HackerNews中文版

我将一个视觉模型指向了一张购物收据。它返回了商店名称、商品列表和总金额。但这些信息，没有一个是收据上有的。这并非 OCR 错误。模型并没有把“7”误读成“1”。它从头开始生成了一张看起来很像真的收据——不同的商店，不同的商品，不同的价格。如果我没有拿着原件，我可能都不会发现它错了。同样的图像，不同的模型（相同的参数量，相同的硬件），五秒钟后：每个商品都正确，商店名称正确，总金额精确到分。使用的模型：minicpm-v 8B（编造） vs qwen3-vl 8B（准确）。两者都是开源的，都需要大约 6GB 的 VRAM，都在 RTX 5080 上通过 Ollama 本地运行。我学到的：1. 视觉模型的幻觉与文本幻觉在性质上有所不同。文本模型会给你一个针对真实问题的错误答案。而视觉模型会给你一个针对它没有处理过的图像的自信答案。后者更难被发现。2. 模型选择比提示工程更重要。同样的提示，同样的图像——一个模型编造了数据，一个模型准确读取了数据。对于一个会凭空捏造数据的模型，任何提示优化都无济于事。3. 置信度评分是强制性的。我添加了一个核对检查：提取的商品总额是否与标明的总额大致相符？这可以捕捉到那些在单个商品层面看起来合理的编造数据。4. 解决方案不是投入更多资金或使用更大的模型。相同的大小（8B），相同的硬件，相同的成本（0 美元）。只是一个不同的架构，它真正读取像素，而不是生成关于它们的看似合理的文本。完整的文章，包括管道架构和代码模式：https://dev.to/rayne_robinson_e479bf0f26/my-ai-read-a-receipt-wrong-it-didnt-misread-it-it-made-one-up-4f5n

查看原文

I pointed a vision model at a grocery receipt. It returned a store name, item list, and total. None of it was on the paper.This wasn't OCR error. The model didn't confuse a "7" for a "1." It generated a plausible-looking receipt from scratch — different store, different items, different prices. If I hadn't been holding the original, I might not have caught it.Same image, different model (same parameter count, same hardware), five seconds later: every item correct, store name right, total accurate to the penny.The models: minicpm-v 8B (fabricated) vs qwen3-vl 8B (accurate). Both open source, both ~6GB VRAM, both running locally via Ollama on an RTX 5080.What I learned:1. Vision model hallucination is qualitatively different from text hallucination. A text model gives you a wrong answer to a real question. A vision model gives you a confident answer to an image it didn't process. The second is harder to detect.2. Model selection matters more than prompt engineering for vision. Same prompt, same image — one model fabricated, one read accurately. No prompt optimization fixes a model that invents data.3. Confidence scoring is mandatory. I added a reconciliation check: do the extracted items sum to roughly the stated total? This catches fabrication that looks plausible at the individual line-item level.4. The fix wasn't more money or a bigger model. Same size (8B), same hardware, same cost ($0). Just a different architecture that actually reads pixels instead of generating plausible text about them.Full writeup with the pipeline architecture and code patterns: https://dev.to/rayne_robinson_e479bf0f26/my-ai-read-a-receipt-wrong-it-didnt-misread-it-it-made-one-up-4f5n

我的 AI 并非是误读了收据，而是凭空捏造了一份。