在你的应用中,你如何处理生产环境 Webhook 交付的可靠性?

1作者: Tanjim8 个月前
大家好, 最近我一直在思考 webhook 交付的可靠性问题。在我参与的许多项目中,构建可靠的 webhook 基础设施都出乎意料地复杂: - 重试逻辑(指数退避、超时) - 处理非 2xx 响应 - 交付监控和告警 - 避免压垮接收方的背压或队列 - 安全签名和验证流程 在某个项目中,由于重试逻辑存在缺陷,一个失败的 webhook 导致支付处理延迟了数小时。另一次,突发流量导致接收端点崩溃,而当时并没有设置 DLQ(死信队列)策略。 我一直在研究团队在这里使用的不同方法: - 你们是构建自己的自定义 webhook 交付队列和监控系统吗? - 使用 AWS EventBridge 或 Step Functions 等云解决方案进行编排? - 还是集成第三方工具来处理交付、重试和可观察性? 我很好奇你们是如何在不耗费大量开发时间在基础设施建设上的前提下,确保生产级规模的可靠性的。最近,我一直在开发一个用于解决这些问题的工具,希望能自动处理这些问题,但很想听听: - 你们认为哪种架构最可靠? - 你们遇到过哪些边缘情况(例如,签名不匹配、下游故障)? - 有没有关于 webhook 生产故障的惨痛经历或经验教训? 期待从你们的经验和 webhook 基础设施的最佳实践中学习!
查看原文
Hey everyone,<p>I’ve been thinking a lot about webhook delivery reliability lately. In many projects I’ve worked on, building robust webhook infra turned out to be deceptively complex:<p>- Retry logic (exponential backoff, timeouts) - Handling non-2xx responses - Delivery monitoring and alerting - Back-pressure or queueing to avoid overwhelming receivers - Secure signing and validation flows<p>In one project, a failed webhook caused a payment processing delay for hours because the retry logic was buggy. Another time, burst traffic took down the receiver endpoint with no DLQ strategy in place.<p>I’ve been researching different approaches teams here use:<p>Do you build your own custom webhook delivery queue and monitoring system? Use cloud solutions like AWS EventBridge or Step Functions to orchestrate? Or integrate third-party tools that handle delivery, retries, and observability for you?<p>I’m curious about how you ensure production-grade reliability at scale without burning dev hours on plumbing. Recently, I’ve been working on a tool in this space to handle these issues automatically, but would love to hear:<p>- What architecture have you found most reliable? - What are the edge cases you’ve encountered (e.g. signature mismatches, downstream outages)? - Any horror stories or lessons learned from webhook failures in production?<p>Looking forward to learning from your experiences and best practices around webhook infra!