Ask HN:有什么软件可以追踪错误并按根本原因分组?

1作者: theamk9 天前
我目前所在的团队负责维护内部批处理系统。为了保证服务质量,我们集中记录所有故障/错误,逐一查看,并将其分配到根本原因工单中。对于频繁发生的故障,我们会尽快修复;对于每周偶尔出现一次的零星故障,我们会优先处理,并将其纳入下一个迭代。有时服务会崩溃,导致数十个故障(通常归入一个根本原因工单),但大多数情况下,每天的故障次数都不到一次。 不幸的是,我们没有好的方法来管理这些故障——目前使用自定义脚本 + JIRA,效果并不理想。我们很乐意付费使用外部服务,但却找不到合适的! 像 Datadog 或 Sentry 这样的工具处理的是统计数据和错误分组……但我们希望查看每一个故障,以确保没有任何遗漏。JIRA 速度太慢,功能也有限。我们甚至尝试过 Google Sheets,但它无法扩展。 有没有人遇到过类似的问题——需要跟踪每一个单独的故障,而不是仅仅聚合/计数?你们都用什么工具?
查看原文
I am working on a team which maintains internal batch processing system. To keep service quality high, we centrally record all failures&#x2F;errors, look at every one of them, and assign them to root cause tickets. A frequent failure will get fixed ASAP, one of those once-per-week sporadic failures will get prioritized and put in the next sprint. Sometimes a service breaks and there are dozens of failures (usually binned to one root cause ticket), but most of the the times it is less than a failure per day.<p>Unfortunately, we have no good way to manage the failures -- we are currently using custom scripts + JIRA and it does not work very well. We are happy to pay to external service, but I simply cannot find anything!<p>Things like Datadog or Sentry deal in statistics and error groups... but we want to look at every failure to make sure nothing slips through the cracks. JIRA is too slow and limited. We even tried Google sheets, but they do not scale.<p>Does anyone has similar problem - tracking each individual failure, not just aggregate&#x2F;counter? What do you use?