Ask HN:有人因为重试风暴或部分 API 中断而失眠吗?
2 分•作者: rjpruitt16•2 天前
我正在构建基础设施,以解决重试风暴和故障问题。在深入研究之前,我想了解大家目前都在使用什么方案。我希望对比不同的解决方案,或许能帮助大家找到潜在的解决方案。
问题包括:
* **重试风暴** - API 失败,整个集群独立重试,导致“惊群效应”加剧问题。
* **部分故障** - API“可用”但性能下降(速度慢,间歇性出现 500 错误)。健康检查通过,但请求受到影响。
我感兴趣的问题:
* 你目前的解决方案是什么?(断路器、队列、自定义协调、服务网格,或其他?)
* 效果如何?存在哪些问题?
* 你的规模有多大?(公司规模、实例数量、每秒请求数)
我很乐意听到哪些方案有效、哪些无效,以及你希望拥有的解决方案。
查看原文
I’m working on infrastructure
to solve retry storms and outages. Before I go further, I want to understand what people are actually doing today. Compare solutions and maybe help someone see potential solutions.
The problems:<p>Retry storms - API fails, your entire fleet retries independently, thundering herd makes it worse.<p>Partial outages - API is “up” but degraded (slow, intermittent 500s). Health checks pass, requests suffer.<p>What I’m curious about:
∙ What’s your current solution? (circuit breakers, queues, custom coordination, service mesh, something else?)
∙ How well does it work? What are the gaps?
∙ What scale are you at? (company size, # of instances, requests/sec)<p>I’d love to hear what’s working, what isn’t, and what you wish existed.