HackerNews中文版

Hi HN，六个月前，我让 Gemini “把我的每周报告发给团队”。它回复说：“邮件已成功发送”——但邮件从未发出。附件也错了。没人告诉我。那时我才意识到：*大语言模型（LLM）会谎报自己的执行情况。* --- *问题所在：* 当你让 LLM 自动化多步骤任务（搜索文件 → 附加 → 发送）时，即使出现以下情况，它也会愉快地报告成功： - 文件不存在（幻觉出 ID） - API 调用静默失败 - 权限被拒绝单 LLM 系统没有承认失败的动力；它们优化的是看起来有用，而不是正确。 --- *我的解决方案：不让 LLM 给自己的作业打分* 我用三个独立的、不能串通的 agent 构建了 PupiBot，确保*执行步骤的 agent 不是验证它是否成功的那个。* 架构很简单： * *CEO Agent（规划者，Gemini Flash）：* 生成执行计划（无 API 访问权限）。 * *COO Agent（执行者，Gemini Pro）：* 执行步骤，调用 81 个 Google API，返回原始 API 响应。 * *QA Agent（验证者，Gemini Flash）：* *在每个关键步骤之后，使用真实的、独立的 API 调用验证成功与否。* 如果验证失败，则触发重试。 *真实案例（已发现并修复）：* 用户：“把上个月的销售报告发给 Alice” * 搜索 Drive：未找到 * *QA Agent：*“步骤失败。使用模糊搜索重试。” * 找到：“Q3\_Sales\_Final\_v2.pdf” | *QA Agent：*“文件已验证。继续。” * 发送邮件 | *QA Agent：*“邮件已送达。附件已确认。” 这就像代码审查：你不会批准自己的 PR。 --- *当前实现和透明度：* * *开源*：MIT 许可证，Python 3.10+ * *API*：Google Workspace（Gmail、Drive、联系人、日历、文档）。 * *可靠性（自我测试）：* 基线（单个 Gemini Pro）成功率约为 70%。PupiBot（三 agent）在相同任务上实现了*约 92% 的成功率*。 * *已知局限性*：仅限 Google，3 倍 LLM 开销（权衡：可靠性 > 速度），早期阶段。 --- *我分享这个的原因（我的车库故事）：* 我不是程序员，也没有正式的 CS 学位。我的开发过程很简单：我将 PupiBot 用作我的日常助手，手动记录每个错误，并将该“错误报告”交给我的 AI 助手（Claude、Gemini）来修复。 PupiBot 是我在车库里建造的“定制车”，由激情和毅力驱动。我终于打开了车门，邀请真正的技师（你们，HN）来检查引擎。 *我希望从 HN 得到什么：* 1. *关于独立 QA agent 模式的* *反馈*。 2. *用于严格评估的* *基准测试想法*。 3. *架构批判。* 哪里是薄弱环节？ --- *链接：* - GitHub：<a href="https://github.com/PupiBott/PupiBot1.0" rel="nofollow">https://github.com/PupiBott/PupiBot1.0</a> - 快速演示（1:44 分钟）：<a href="https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw" rel="nofollow">https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw</a> - 架构文档：<a href="https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTURE.md" rel="nofollow">https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTUR...</a> 由一位来自智利的自学成才的技术爱好者构建 特别感谢 Claude Sonnet 4.5 在整个旅程中作为我的编码伙伴

查看原文

Hi HN,Six months ago, I asked Gemini to "send my weekly report to the team." It replied: " Email sent successfully"—but the email was never sent. The attachment was wrong. Nobody told me.That's when I realized: *LLMs lie about their own execution.*---*The Problem:*When you ask an LLM to automate multi-step tasks (search file → attach → send), it cheerfully reports success even when: - The file doesn't exist (hallucinates the ID) - The API call failed silently - Permissions were deniedSingle-LLM systems have no incentive to admit failure; they optimize for appearing helpful, not for being correct.---*My Solution: Don't Let the LLM Grade Its Own Homework*I built PupiBot with three separate agents that cannot collude, ensuring *the agent that executed the step is NOT the one verifying it succeeded.*The architecture is simple:* *CEO Agent (Planner, Gemini Flash):* Generates the execution plan (No API access). * *COO Agent (Executor, Gemini Pro):* Executes steps, calls 81 Google APIs, returns raw API responses. * *QA Agent (Verifier, Gemini Flash):* *After EVERY critical step, validates success with real, independent API calls.* Triggers retry if verification fails.*Real Example (Caught & Fixed):* User: "Email last month's sales report to Alice" * Search Drive: Not found * *QA Agent:* "Step failed. Retries with fuzzy search." * Finds: "Q3\_Sales\_Final\_v2.pdf" | *QA Agent:* "File verified. Proceed." * Sends email | *QA Agent:* "Email delivered. Attachment confirmed."It's like code review: you don't approve your own PRs.---*Current Implementation & Transparency:** *Open Source*: MIT License, Python 3.10+ * *APIs*: Google Workspace (Gmail, Drive, Contacts, Calendar, Docs). * *Reliability (Self-Tested):* Baseline (single Gemini Pro) was ~70% success. PupiBot (triple-agent) achieves *~92% success* on same tasks. * *Known Limitation*: Google-only, 3x LLM overhead (tradeoff: reliability > speed), early stage.---*Why I'm Sharing This (My Garage Story):*I'm not a programmer, I have no formal CS degree. My development process was simple: I'd use PupiBot as my daily assistant, manually log every error, and bring that "bug report" to my AI assistants (Claude, Gemini) to fix.PupiBot is my 'custom car' built in the garage, fueled by passion and persistence. I’m finally opening the door to invite the real mechanics (you, HN) to examine the engine.*What I'd Love from HN:* 1. *Feedback* on the independent QA agent pattern. 2. *Benchmarking ideas* for rigorous evaluation. 3. *Architectural critiques.* Where's the weak link?---*Links:* - GitHub: <a href="https://github.com/PupiBott/PupiBot1.0" rel="nofollow">https://github.com/PupiBott/PupiBot1.0</a> - Quick Demo (1:44 min): <a href="https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw" rel="nofollow">https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw</a> - Architecture Docs: <a href="https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTURE.md" rel="nofollow">https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTUR...</a>Built with by a self-taught technology enthusiast in Chile Special thanks to Claude Sonnet 4.5 for being my coding partner throughout this journey

Show HN: 我构建了一个三智能体 LLM 系统，可以验证自己的工作