Ask HN:你们是如何在生产环境中可靠地扩展 AI 智能体的?

1作者: nivedit-jain9 个月前
我希望向那些运行代理(agent)的人学习,不仅仅是演示。如果你有实际的生产环境,请分享哪些是有效的,哪些是失败的? 我最感兴趣的是: * **编排器的选择及原因:** LangGraph、Temporal、Airflow、Prefect、自定义队列。 * **状态和检查点:** 你在哪里持久化步骤,如何重放,如何处理模式更改。 * **并发控制:** 并行工具调用、背压、超时、重试的幂等性。 * **自动伸缩和成本:** 保持延迟和支出合理的策略,竞价实例(spot)与按需实例(on-demand)的选择,GPU共享。 * **记忆和检索:** 向量数据库(vector DB)与键值存储(KV store)的选择,驱逐策略,防止上下文过时。 * **可观测性:** 追踪、指标、能够预测事件的评估。 * **安全性和隔离:** 沙盒工具、速率限制、滥用过滤器、个人身份信息(PII)处理。 * **一个“战争故事”:** 让你吸取教训的事件以及解决方案。 背景(避免走马观花):小团队,Python,k8s,MongoDB用于状态存储,Redis用于队列,一切都是自定义的,正在尝试LangGraph和Temporal。 乐于在评论中分享配置和交流经验。 请回答任何子集。 即使是你的技术栈的简单概述和一个需要注意的问题,对其他阅读者也会有帮助。 谢谢!
查看原文
I’m looking to learn from people running agents beyond demos. If you have a production setup, would you share what works and what broke?<p>What I’m most curious about:<p>- Orchestrator choice and why: LangGraph, Temporal, Airflow, Prefect, custom queues.<p>- State and checkpointing: where do you persist steps, how do you replay, how do you handle schema changes.<p>- Concurrency control: parallel tool calls, backpressure, timeouts, idempotency for retries.<p>- Autoscaling and cost: policies that kept latency and spend sane, spot vs on-demand, GPU sharing.<p>- Memory and retrieval: vector DB vs KV store, eviction policies, preventing stale context.<p>- Observability: tracing, metrics, evals that actually predicted incidents.<p>- Safety and isolation: sandboxing tools, rate limits, abuse filters, PII handling.<p>- A war story: the incident that taught you a lesson and the fix.<p>Context (so it’s not a drive-by): small team, Python, k8s, MongoDB for state, Redis for queues, everything custom, experimenting with LangGraph and Temporal. Happy to share configs and trade notes in the comments.<p>Answer any subset. Even a quick sketch of your stack and one gotcha would help others reading this. Thanks!