我的 AI 并非是误读了收据,而是凭空捏造了一份。

1作者: Raywob4 天前
我将一个视觉模型指向了一张购物收据。它返回了商店名称、商品列表和总金额。但这些信息,没有一个是收据上有的。<p>这并非 OCR 错误。模型并没有把“7”误读成“1”。它从头开始生成了一张看起来很像真的收据——不同的商店,不同的商品,不同的价格。如果我没有拿着原件,我可能都不会发现它错了。<p>同样的图像,不同的模型(相同的参数量,相同的硬件),五秒钟后:每个商品都正确,商店名称正确,总金额精确到分。<p>使用的模型:minicpm-v 8B(编造) vs qwen3-vl 8B(准确)。两者都是开源的,都需要大约 6GB 的 VRAM,都在 RTX 5080 上通过 Ollama 本地运行。<p>我学到的:<p>1. 视觉模型的幻觉与文本幻觉在性质上有所不同。文本模型会给你一个针对真实问题的错误答案。而视觉模型会给你一个针对它没有处理过的图像的自信答案。后者更难被发现。<p>2. 模型选择比提示工程更重要。同样的提示,同样的图像——一个模型编造了数据,一个模型准确读取了数据。对于一个会凭空捏造数据的模型,任何提示优化都无济于事。<p>3. 置信度评分是强制性的。我添加了一个核对检查:提取的商品总额是否与标明的总额大致相符?这可以捕捉到那些在单个商品层面看起来合理的编造数据。<p>4. 解决方案不是投入更多资金或使用更大的模型。相同的大小(8B),相同的硬件,相同的成本(0 美元)。只是一个不同的架构,它真正读取像素,而不是生成关于它们的看似合理的文本。<p>完整的文章,包括管道架构和代码模式:https://dev.to/rayne_robinson_e479bf0f26/my-ai-read-a-receipt-wrong-it-didnt-misread-it-it-made-one-up-4f5n
查看原文
I pointed a vision model at a grocery receipt. It returned a store name, item list, and total. None of it was on the paper.<p>This wasn&#x27;t OCR error. The model didn&#x27;t confuse a &quot;7&quot; for a &quot;1.&quot; It generated a plausible-looking receipt from scratch — different store, different items, different prices. If I hadn&#x27;t been holding the original, I might not have caught it.<p>Same image, different model (same parameter count, same hardware), five seconds later: every item correct, store name right, total accurate to the penny.<p>The models: minicpm-v 8B (fabricated) vs qwen3-vl 8B (accurate). Both open source, both ~6GB VRAM, both running locally via Ollama on an RTX 5080.<p>What I learned:<p>1. Vision model hallucination is qualitatively different from text hallucination. A text model gives you a wrong answer to a real question. A vision model gives you a confident answer to an image it didn&#x27;t process. The second is harder to detect.<p>2. Model selection matters more than prompt engineering for vision. Same prompt, same image — one model fabricated, one read accurately. No prompt optimization fixes a model that invents data.<p>3. Confidence scoring is mandatory. I added a reconciliation check: do the extracted items sum to roughly the stated total? This catches fabrication that looks plausible at the individual line-item level.<p>4. The fix wasn&#x27;t more money or a bigger model. Same size (8B), same hardware, same cost ($0). Just a different architecture that actually reads pixels instead of generating plausible text about them.<p>Full writeup with the pipeline architecture and code patterns: https:&#x2F;&#x2F;dev.to&#x2F;rayne_robinson_e479bf0f26&#x2F;my-ai-read-a-receipt-wrong-it-didnt-misread-it-it-made-one-up-4f5n