HackerNews中文版

我希望向那些运行代理（agent）的人学习，不仅仅是演示。如果你有实际的生产环境，请分享哪些是有效的，哪些是失败的？我最感兴趣的是： * **编排器的选择及原因：** LangGraph、Temporal、Airflow、Prefect、自定义队列。 * **状态和检查点：** 你在哪里持久化步骤，如何重放，如何处理模式更改。 * **并发控制：** 并行工具调用、背压、超时、重试的幂等性。 * **自动伸缩和成本：** 保持延迟和支出合理的策略，竞价实例（spot）与按需实例（on-demand）的选择，GPU共享。 * **记忆和检索：** 向量数据库（vector DB）与键值存储（KV store）的选择，驱逐策略，防止上下文过时。 * **可观测性：** 追踪、指标、能够预测事件的评估。 * **安全性和隔离：** 沙盒工具、速率限制、滥用过滤器、个人身份信息（PII）处理。 * **一个“战争故事”：** 让你吸取教训的事件以及解决方案。背景（避免走马观花）：小团队，Python，k8s，MongoDB用于状态存储，Redis用于队列，一切都是自定义的，正在尝试LangGraph和Temporal。乐于在评论中分享配置和交流经验。请回答任何子集。即使是你的技术栈的简单概述和一个需要注意的问题，对其他阅读者也会有帮助。谢谢！

查看原文

I’m looking to learn from people running agents beyond demos. If you have a production setup, would you share what works and what broke?What I’m most curious about:- Orchestrator choice and why: LangGraph, Temporal, Airflow, Prefect, custom queues.- State and checkpointing: where do you persist steps, how do you replay, how do you handle schema changes.- Concurrency control: parallel tool calls, backpressure, timeouts, idempotency for retries.- Autoscaling and cost: policies that kept latency and spend sane, spot vs on-demand, GPU sharing.- Memory and retrieval: vector DB vs KV store, eviction policies, preventing stale context.- Observability: tracing, metrics, evals that actually predicted incidents.- Safety and isolation: sandboxing tools, rate limits, abuse filters, PII handling.- A war story: the incident that taught you a lesson and the fix.Context (so it’s not a drive-by): small team, Python, k8s, MongoDB for state, Redis for queues, everything custom, experimenting with LangGraph and Temporal. Happy to share configs and trade notes in the comments.Answer any subset. Even a quick sketch of your stack and one gotcha would help others reading this. Thanks!

Ask HN：你们是如何在生产环境中可靠地扩展 AI 智能体的？