L3 事故的生产数据访问中断

1作者: addieg6 个月前
凌晨两点,你最大的客户系统崩溃了,而支持团队清楚地知道哪个数据库查询能解决问题。 但首先:编写数据清洗脚本,获得法律部门的批准,配置副本访问权限,然后等待在延迟20分钟的副本上运行查询,而每个查询需要8分钟。 三个小时后,支持团队终于开始调试。实际修复只用了30分钟。你的服务等级协议(SLA)早就超标了。 你不是在焦头烂额地编写数据脱敏脚本,眼睁睁看着客户损失金钱的工程师,就是知道该运行哪个查询却无法触碰生产环境的支持人员。 我们花在获取数据访问权限上的时间,比实际解决问题的时间还长。整个系统都颠倒了。 还有人也遇到这种疯狂的情况吗,或者你们找到了更好的解决办法?
查看原文
It's 2 AM, your biggest customer is down, and support knows exactly what database query will solve it. But first: write a sanitization script, get legal approval, provision replica access, then wait for queries on a 20-minute-lagged replica that takes 8 minutes per query. Three hours later, support finally starts debugging. The actual fix takes 30 minutes. Your SLA is already blown. You're either the engineer frantically writing data masking scripts while a customer bleeds money, or you're the support person who knows the exact query to run but can't touch production. We spend more time getting access to the data than actually fixing the problem. The whole system is backwards. Anyone else dealing with this madness, or have you found a way that doesn't suck?