LLM 智能体工作流程会默默失败。以下是我们希望存在的可靠性层
1 分•作者: AJ1313k•7 个月前
过去几个月,我和我的联合创始人一直在构建复杂的智能体工作流程,并且反复遇到同样的可靠性问题:共享状态不一致、静默失败、智能体之间出现偏差,以及在不重启整个流程的情况下无法进行干净的恢复。<p>很明显,大多数失败并非“大语言模型问题”,而是多智能体设置中出现的经典分布式系统问题。<p>由于当前生态系统中没有任何东西能对此进行妥善处理,我们开始为智能体工作流程构建一个可靠性层——它为多智能体系统增加了结构、安全性以及可预测的恢复能力,而无需开发人员重写他们的技术栈。<p>我们希望与遇到类似问题或正在构建生产级智能体工作流程的人联系。目标是了解其他人如何看待这些系统中的可靠性、故障恢复和工作流程一致性。<p>如果您正在从事这方面的工作或想尝试抢先体验,请点击此链接:
https://tally.so/r/LZDb0j<p>非常感谢大家分享关于智能体可靠性的任何想法或经验,特别是故障案例或痛点。
查看原文
For the past few months my co-founder and I have been building complex agentic workflows, and we kept hitting the same recurring reliability issues: inconsistent shared state, silent failures, agents diverging from each other, and no clean way to recover without restarting the entire workflow.<p>It became clear that most failures weren’t “LLM problems” but classic distributed-systems problems showing up in multi-agent setups.<p>Since nothing in the current ecosystem addressed this properly, we started building a reliability layer for agent workflows — something that adds structure, safety, and predictable recovery to multi-agent systems without forcing developers to rewrite their stack.<p>We’re looking to connect with people who have run into similar issues or are building production-grade agent workflows. The goal is to understand how others think about reliability, failure recovery, and workflow consistency in these systems.<p>If you’re working on this space or want to try the early access, here’s the link:
https://tally.so/r/LZDb0j<p>Would appreciate any thoughts or experiences others have had around agent reliability, especially failure cases or pain points.