我们对人工智能(AI)智能体的评估方式是不是错了?

1作者: imshashank22 天前
过去一年,我一直在构建 AI 智能体,我注意到一个令人不安的现象:我与每个人交流时,他们评估智能体的方式都一样——只看最终输出,然后问“它正确吗?” 但这完全错了。 一个智能体可以通过错误的路径得到正确的答案。它可能在中间步骤中产生幻觉,但仍然得出正确的结论。它可能违反约束,但技术上却实现了目标。 传统的机器学习指标(准确率、精确率、召回率)忽略了所有这些,因为它们只关注最终输出。 我一直在尝试一种不同的方法:使用智能体的系统提示作为 ground truth(真实情况),评估整个轨迹(不仅仅是最终输出),并使用多维评分(不仅仅是单一指标)。 结果天壤之别。突然间,我能够看到传统指标完全错过的幻觉、约束违规、低效路径和一致性问题。 是我疯了吗?还是整个行业都在错误地评估智能体? 我很乐意听取其他构建智能体的人的意见。你们是如何评估它们的?遇到了什么问题?
查看原文
I&#x27;ve been building AI agents for the past year, and I&#x27;ve noticed something troubling: everyone I talk to is evaluating their agents the same way—by looking at the final output and asking &quot;Is it correct?&quot;<p>But that&#x27;s completely wrong.<p>An agent can get the right answer through the wrong path. It can hallucinate in intermediate steps but still reach the correct conclusion. It can violate constraints while technically achieving the goal.<p>Traditional ML metrics (accuracy, precision, recall) miss all of this because they only look at the final output.<p>I&#x27;ve been experimenting with a different approach: using the agent&#x27;s system prompt as ground truth, evaluating the entire trajectory (not just the final output), and using multi-dimensional scoring (not just a single metric).<p>The results are night and day. Suddenly I can see hallucinations, constraint violations, inefficient paths, and consistency issues that traditional metrics completely missed.<p>Am I crazy? Or is the entire industry evaluating agents wrong?<p>I&#x27;d love to hear from others who are building agents. How are you evaluating them? What problems have you run into?