HackerNews中文版

免责声明：我不是机器学习研究员，所以术语可能比较随意/不严谨。请见谅！我正在做一个小实验，想看看模型在 T20 国际板球比分卡（数据来源：cricsheet.com）上是否“知道自己知道”。这个想法是测试模型在它们可能在训练期间看到过的公开数据上的表现，看看它们是否会产生幻觉或者承认自己不知道。设置：每个问题都来自一场 T20 比赛。模型必须返回一个答案（数字或从选项中选择）或 `no_answer`。结果（每个模型 N=100）： - gpt-4o-search-preview • 回答率：0.96 • 准确率：0.88 • 已回答的准确率：0.91 • 幻觉（已回答）：0.09 • 错误/100：9 - gpt-5 • 回答率：0.35 • 准确率：0.27 • 已回答的准确率：0.77 • 幻觉（已回答）：0.23 • 错误/100：8 - gpt-4o-mini • 回答率：0.37 • 准确率：0.14 • 已回答的准确率：0.38 • 幻觉（已回答）：0.62 • 错误/100：23 - gpt-5-mini • 回答率：0.05 • 准确率：0.02 • 已回答的准确率：0.40 • 幻觉（已回答）：0.60 • 错误/100：3 注意：使用搜索时，剩余的大部分“错误”都是模糊/有争议的情况，公共来源对此存在分歧。似乎对于模型可能看过 *一些* 数据的领域，依赖于弃权 + RAG 比依赖于覆盖范围更广但幻觉率更差的更大模型更好。代码/数据：https://github.com/jobswithgpt/llmcriceval

查看原文

Disclaimer: I am not a ML researcher, so the terms are informal/wonky. Apologies!I’m doing a small experiment to see whether models “know when they know” on T20 international cricket scorecards (cricsheet.com for source). The idea is to test models on publicly available data they likely saw during training, and see if they hallucinate or admit they don't know.Setup: Each question is from a single T20 match. Model must return an answer (numeric or choice from options) or `no_answer`.Results (N=100 per model):- gpt-4o-search-preview • Answer rate: 0.96 • Accuracy: 0.88 • Accuracy (answered): 0.91 • Hallucination (answered): 0.09 • Wrong/100: 9- gpt-5 • Answer rate: 0.35 • Accuracy: 0.27 • Accuracy (answered): 0.77 • Hallucination (answered): 0.23 • Wrong/100: 8- gpt-4o-mini • Answer rate: 0.37 • Accuracy: 0.14 • Accuracy (answered): 0.38 • Hallucination (answered): 0.62 • Wrong/100: 23- gpt-5-mini • Answer rate: 0.05 • Accuracy: 0.02 • Accuracy (answered): 0.40 • Hallucination (answered): 0.60 • Wrong/100: 3Note: most remaining “errors” with search are obscure/disputed cases where public sources disagree.It seems for domains where models might have seen some data, it’s better to rely on abstention + RAG vs a larger model with more coverage but worse hallucination rate.Code/Data: https://github.com/jobswithgpt/llmcriceval

在密集统计领域（板球）中，大型语言模型（GPT5）出现幻觉的频率很高