Ask HN: 作为一个开发者,我认为监控警报大多是噪音,这种想法对吗?
3 分•作者: yansoki•8 个月前
我是一个独立开发者,正在开发一款新工具,需要听听各位运维和基础设施专家的意见。
我的背景是软件开发,而非 SRE(站点可靠性工程师)。在我看来,从基础设施冒出来的监控警报一直以来都是一个巨大的干扰。我经常会收到关于某个服务“CPU 负载过高”的页面通知,然后花一个小时翻阅日志和仪表盘,最后却发现这仅仅是短暂的流量峰值,根本不是什么真正的问题。这感觉是对开发者时间的巨大浪费。
我的假设是,我们使用的工具过于关注静态阈值(例如,“CPU > 80%”),而缺乏足够的上下文来告诉我们什么才是真正的异常。我一直在探索一种基于同行组比较的不同方法(例如,api-server-5 的行为是否与其他同行 api-server-1 到 4 不同?)。
但我从开发者的角度出发,并且非常清楚我可能忽略了更全面的情况。我很乐意向那些每天都接触这些东西的人学习。
贵公司有多少开发时间浪费在调查“误报”的基础设施警报上?
您认为目前的工具(Datadog、Prometheus 等)是否给开发团队带来了沉重的负担?
“同行组上下文”这个想法是否可行,或者是否有更好的方法来解决这个问题,而我没有看到?
我还没有构建太多东西,因为我致力于解决一个真正的问题。任何严厉的反馈或见解都将非常有价值。
查看原文
I'm a solo developer working on a new tool, and I need a reality check from the ops and infrastructure experts here.
My background is in software development, not SRE. From my perspective, the monitoring alerts that bubble up from our infrastructure have always felt like a massive distraction. I'll get a page for "High CPU" on a service, spend an hour digging through logs and dashboards, only to find out it was just a temporary traffic spike and not a real issue. It feels like a huge waste of developer time.
My hypothesis is that the tools we use are too focused on static thresholds (e.g., "CPU > 80%") and lack the context to tell us what's actually an anomaly. I've been exploring a different approach based on peer-group comparisons (e.g., is api-server-5 behaving differently from its peers api-server-1 through 4?).
But I'm coming at this from a dev perspective and I'm very aware that I might be missing the bigger picture. I'd love to learn from the people who live and breathe this stuff.
How much developer time is lost at your company to investigating "false positive" infrastructure alerts?
Do you think the current tools (Datadog, Prometheus, etc.) create a significant burden for dev teams?
Is the idea of "peer-group context" a sensible direction, or are there better ways to solve this that I'm not seeing?
I haven't built much yet because I'm committed to solving a real problem. Any brutal feedback or insights would be incredibly valuable.