HackerNews中文版

我正在构建基础设施，以解决重试风暴和故障问题。在深入研究之前，我想了解大家目前都在使用什么方案。我希望对比不同的解决方案，或许能帮助大家找到潜在的解决方案。问题包括： * **重试风暴** - API 失败，整个集群独立重试，导致“惊群效应”加剧问题。 * **部分故障** - API“可用”但性能下降（速度慢，间歇性出现 500 错误）。健康检查通过，但请求受到影响。我感兴趣的问题： * 你目前的解决方案是什么？（断路器、队列、自定义协调、服务网格，或其他？） * 效果如何？存在哪些问题？ * 你的规模有多大？（公司规模、实例数量、每秒请求数）我很乐意听到哪些方案有效、哪些无效，以及你希望拥有的解决方案。

查看原文

I’m working on infrastructure to solve retry storms and outages. Before I go further, I want to understand what people are actually doing today. Compare solutions and maybe help someone see potential solutions. The problems:<p>Retry storms - API fails, your entire fleet retries independently, thundering herd makes it worse.<p>Partial outages - API is “up” but degraded (slow, intermittent 500s). Health checks pass, requests suffer.<p>What I’m curious about: ∙ What’s your current solution? (circuit breakers, queues, custom coordination, service mesh, something else?) ∙ How well does it work? What are the gaps? ∙ What scale are you at? (company size, # of instances, requests/sec)<p>I’d love to hear what’s working, what isn’t, and what you wish existed.

Ask HN：有人因为重试风暴或部分 API 中断而失眠吗？