HackerNews中文版

我正在尝试理解由多个服务和外部集成（例如 Stripe、Twilio、内部微服务、队列、Webhook 等）组成的系统，团队是如何实际调试生产环境问题的。在实际操作中，当出现问题时，通常的工作流程似乎是： * 警报触发（Datadog/Sentry/CloudWatch/等） * 或者客户投诉 * 工程师随后开始检查跨多个系统的日志、追踪、仪表盘 * 并最终手动重构跨服务发生的事件我感到好奇的是： * 今天，您是如何实际追踪跨多个服务的单个失败请求或事务的？ * 在实践中（而非理论上），您最依赖哪些工具？ * 通常在哪里会出现问题——日志、追踪、插桩，还是仅仅是缺乏上下文？ * 从“出了问题”到“我们确切知道原因”通常需要多长时间？ * 这个过程中，有多少部分仍然主要是手动拼接信息？我想了解在实际操作中，尤其是在拥有大量外部集成和异步流程的系统中，真正的痛点是什么。

查看原文

I’m trying to understand how teams actually debug production issues in systems made up of multiple services and external integrations (e.g. Stripe, Twilio, internal microservices, queues, webhooks, etc.).In practice, when something breaks, it seems like the workflow is usually:an alert fires (Datadog/Sentry/CloudWatch/etc.)or a customer complainsengineers then start checking logs, traces, dashboards across multiple systemsand eventually manually reconstruct what happened across servicesWhat I’m curious about:How do you actually trace a single failed request or transaction across multiple services today?What tools do you rely on most in practice (not in theory)?Where does it usually break down — logs, tracing, instrumentation, or just missing context?How long does it typically take to go from “something is wrong” → “we know exactly why it broke”?What part of this is still mostly manual stitching together of information?Trying to understand what the real pain points are in practice, especially in systems with lots of external integrations and async flows.

Ask HN：大型互联后端系统中的调试失败