HackerNews中文版

我尝试使用基准测试的方式来评估一个 AI 智能体。结果它以我意想不到的方式失败了。大多数失败并非源于模型质量问题，而是系统层面的问题。以下是一些来自小型测试套件的例子：- 工具调用中的 URL 链接失效 → 分数降至 22- 智能体在云环境中调用本地主机 → 卡在 46 分- 真实的 CVE 漏洞被标记为幻觉 → 评估问题，而非模型问题- Reddit 屏蔽请求 → 外部依赖失败- 生产环境中 API 密钥缺失 → 静默失败每次运行都会暴露一个真实的 bug，但并非我最初试图衡量的类型。令我惊讶的是，评估智能体不仅仅是评估输出结果。它还涉及到验证整个系统：工具、环境、数据访问，以及智能体与所有这些的交互方式。换句话说，大多数失败模式更像是软件 bug，而不是 LLM 的错误。这让我认为，智能体的评估循环应该更像软件测试，而不是基准测试： - 可重复的测试套件 - 明确的通过/失败标准 - 回归检测 - 根本原因分析否则，很容易将失败归咎于模型，而实际上它们来自其他地方。我最终构建了一个小工具来规范这个流程，但对我来说，更大的收获是，与标准基准测试相比，现实世界中的智能体评估实际上是多么混乱。很好奇其他人是如何处理这个问题的，尤其是在生产环境中。

查看原文

I tried to evaluate an AI agent using a benchmark-style approach.It failed in ways I didn’t expect.Instead of model quality issues, most failures came from system-level problems. A few examples from a small test suite:- Broken URLs in tool calls → score dropped to 22- Agent calling localhost in a cloud environment → got stuck at 46- Real CVEs flagged as hallucinations → evaluation issue, not model issue- Reddit blocking requests → external dependency failure- Missing API key in production → silent failureEach run surfaced a real bug, but not the kind I was originally trying to measure.What surprised me is that evaluating agents isn’t just about scoring outputs. It’s about validating the entire system: tools, environment, data access, and how the agent interacts with all of it.In other words, most of the failure modes looked more like software bugs than LLM mistakes.This made me think that evaluation loops for agents should look more like software testing than benchmarking: - repeatable test suites - clear pass/fail criteria - regression detection - root cause analysisOtherwise it’s very easy to misattribute failures to the model when they’re actually coming from somewhere else.I ended up building a small tool to structure this process, but the bigger takeaway for me is how messy real-world agent evaluation actually is compared to standard benchmarks.Curious how others are approaching this, especially in production settings.

我在生产环境中评估 AI 智能体时遇到的问题