Show HN: 我构建了一个三智能体 LLM 系统,可以验证自己的工作

1作者: pupibott8 个月前
Hi HN, 六个月前,我让 Gemini “把我的每周报告发给团队”。它回复说:“邮件已成功发送”——但邮件从未发出。附件也错了。没人告诉我。 那时我才意识到:*大语言模型(LLM)会谎报自己的执行情况。* --- *问题所在:* 当你让 LLM 自动化多步骤任务(搜索文件 → 附加 → 发送)时,即使出现以下情况,它也会愉快地报告成功: - 文件不存在(幻觉出 ID) - API 调用静默失败 - 权限被拒绝 单 LLM 系统没有承认失败的动力;它们优化的是看起来有用,而不是正确。 --- *我的解决方案:不让 LLM 给自己的作业打分* 我用三个独立的、不能串通的 agent 构建了 PupiBot,确保*执行步骤的 agent 不是验证它是否成功的那个。* 架构很简单: * *CEO Agent(规划者,Gemini Flash):* 生成执行计划(无 API 访问权限)。 * *COO Agent(执行者,Gemini Pro):* 执行步骤,调用 81 个 Google API,返回原始 API 响应。 * *QA Agent(验证者,Gemini Flash):* *在每个关键步骤之后,使用真实的、独立的 API 调用验证成功与否。* 如果验证失败,则触发重试。 *真实案例(已发现并修复):* <i>用户:“把上个月的销售报告发给 Alice”</i> * 搜索 Drive:<i>未找到</i> * *QA Agent:*“步骤失败。使用模糊搜索重试。” * 找到:“Q3\_Sales\_Final\_v2.pdf” | *QA Agent:*“文件已验证。继续。” * 发送邮件 | *QA Agent:*“邮件已送达。附件已确认。” 这就像代码审查:你不会批准自己的 PR。 --- *当前实现和透明度:* * *开源*:MIT 许可证,Python 3.10+ * *API*:Google Workspace(Gmail、Drive、联系人、日历、文档)。 * *可靠性(自我测试):* 基线(单个 Gemini Pro)成功率约为 70%。PupiBot(三 agent)在相同任务上实现了*约 92% 的成功率*。 * *已知局限性*:仅限 Google,3 倍 LLM 开销(权衡:可靠性 &gt; 速度),早期阶段。 --- *我分享这个的原因(我的车库故事):* 我不是程序员,也没有正式的 CS 学位。我的开发过程很简单:我将 PupiBot 用作我的日常助手,手动记录每个错误,并将该“错误报告”交给我的 AI 助手(Claude、Gemini)来修复。 PupiBot 是我在车库里建造的“定制车”,由激情和毅力驱动。我终于打开了车门,邀请真正的技师(你们,HN)来检查引擎。 *我希望从 HN 得到什么:* 1. *关于独立 QA agent 模式的* *反馈*。 2. *用于严格评估的* *基准测试想法*。 3. *架构批判。* 哪里是薄弱环节? --- *链接:* - GitHub:<a href="https://github.com/PupiBott/PupiBot1.0" rel="nofollow">https://github.com/PupiBott/PupiBot1.0</a> - 快速演示(1:44 分钟):<a href="https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw" rel="nofollow">https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw</a> - 架构文档:<a href="https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTURE.md" rel="nofollow">https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTUR...</a> <i>由一位来自智利的自学成才的技术爱好者构建</i> <i>特别感谢 Claude Sonnet 4.5 在整个旅程中作为我的编码伙伴</i>
查看原文
Hi HN,<p>Six months ago, I asked Gemini to &quot;send my weekly report to the team.&quot; It replied: &quot; Email sent successfully&quot;—but the email was never sent. The attachment was wrong. Nobody told me.<p>That&#x27;s when I realized: *LLMs lie about their own execution.*<p>---<p>*The Problem:*<p>When you ask an LLM to automate multi-step tasks (search file → attach → send), it cheerfully reports success even when: - The file doesn&#x27;t exist (hallucinates the ID) - The API call failed silently - Permissions were denied<p>Single-LLM systems have no incentive to admit failure; they optimize for appearing helpful, not for being correct.<p>---<p>*My Solution: Don&#x27;t Let the LLM Grade Its Own Homework*<p>I built PupiBot with three separate agents that cannot collude, ensuring *the agent that executed the step is NOT the one verifying it succeeded.*<p>The architecture is simple:<p>* *CEO Agent (Planner, Gemini Flash):* Generates the execution plan (No API access). * *COO Agent (Executor, Gemini Pro):* Executes steps, calls 81 Google APIs, returns raw API responses. * *QA Agent (Verifier, Gemini Flash):* *After EVERY critical step, validates success with real, independent API calls.* Triggers retry if verification fails.<p>*Real Example (Caught &amp; Fixed):* <i>User: &quot;Email last month&#x27;s sales report to Alice&quot;</i> * Search Drive: <i>Not found</i> * *QA Agent:* &quot;Step failed. Retries with fuzzy search.&quot; * Finds: &quot;Q3\_Sales\_Final\_v2.pdf&quot; | *QA Agent:* &quot;File verified. Proceed.&quot; * Sends email | *QA Agent:* &quot;Email delivered. Attachment confirmed.&quot;<p>It&#x27;s like code review: you don&#x27;t approve your own PRs.<p>---<p>*Current Implementation &amp; Transparency:*<p>* *Open Source*: MIT License, Python 3.10+ * *APIs*: Google Workspace (Gmail, Drive, Contacts, Calendar, Docs). * *Reliability (Self-Tested):* Baseline (single Gemini Pro) was ~70% success. PupiBot (triple-agent) achieves *~92% success* on same tasks. * *Known Limitation*: Google-only, 3x LLM overhead (tradeoff: reliability &gt; speed), early stage.<p>---<p>*Why I&#x27;m Sharing This (My Garage Story):*<p>I&#x27;m not a programmer, I have no formal CS degree. My development process was simple: I&#x27;d use PupiBot as my daily assistant, manually log every error, and bring that &quot;bug report&quot; to my AI assistants (Claude, Gemini) to fix.<p>PupiBot is my &#x27;custom car&#x27; built in the garage, fueled by passion and persistence. I’m finally opening the door to invite the real mechanics (you, HN) to examine the engine.<p>*What I&#x27;d Love from HN:* 1. *Feedback* on the independent QA agent pattern. 2. *Benchmarking ideas* for rigorous evaluation. 3. *Architectural critiques.* Where&#x27;s the weak link?<p>---<p>*Links:* - GitHub: <a href="https:&#x2F;&#x2F;github.com&#x2F;PupiBott&#x2F;PupiBot1.0" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;PupiBott&#x2F;PupiBot1.0</a> - Quick Demo (1:44 min): <a href="https:&#x2F;&#x2F;youtube.com&#x2F;shorts&#x2F;wykKckwaukY?si=0xdn7rM6B2tMAIPw" rel="nofollow">https:&#x2F;&#x2F;youtube.com&#x2F;shorts&#x2F;wykKckwaukY?si=0xdn7rM6B2tMAIPw</a> - Architecture Docs: <a href="https:&#x2F;&#x2F;github.com&#x2F;PupiBott&#x2F;PupiBot1.0&#x2F;blob&#x2F;main&#x2F;ARCHITECTURE.md" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;PupiBott&#x2F;PupiBot1.0&#x2F;blob&#x2F;main&#x2F;ARCHITECTUR...</a><p><i>Built with by a self-taught technology enthusiast in Chile </i> <i>Special thanks to Claude Sonnet 4.5 for being my coding partner throughout this journey</i>