提问 HN:你的“智能代理软件暗工厂”长什么样?

1作者: ElFitz大约 2 小时前
在一些评论区里,你们中的一些人分享了有趣的观点和模式,这让我相信,所有对“线束工程”感兴趣的人都在研究某种软件“暗工厂”。 我们有 OpenAI 的 Symphony[1],StrongDM 的 Factory[2],Yegge 的 GasTown[3],以及其他一些我可能错过的。 所以我很好奇。你们都在研究什么?学到了什么?哪些有效,哪些失败了?你们认为接下来会是什么? 我先来。我尝试的第一个产生有趣结果的事情是,在可能的情况下,为模型提供一个“地面真值”或参考,以便其迭代:UI 工作的截图或模型图,逻辑的 API 契约和单元/集成测试。这就是我们都熟知的 Ralph Loop。一个反馈循环。 第二个(我知道,很明显)是拆分规划和实施。 接下来是其他模型的审查和迭代循环,结果显著。然而,执行代理经常通过将事情推迟到无底洞或说一些实际上重要的反馈意见超出范围来逃避。另一个反馈循环。我发现将这些审查变成“硬门槛”也有其自身的问题,因为审查代理总是会找到一些可以挑剔的地方,将这种迭代实施方法变成近乎无限的循环。 将这些审查与代码一起提交计划导致了一个有趣的意外:审查代理自发且出乎意料地注意到了这些,并通过比较计划和实施来大大改进了它们的反馈(应该很明显,你可以想象我第一次看到 GitHub Copilot 实际上提供了有用的反馈,而不是通常的拼写错误挑剔时的惊讶)。 然后,这里的评论让我想到了一种对抗性的“绿队/红队”流程。 第一个代理根据我的初始计划创建一个规范(基于 StrongDM 的 NLSpec),并进行审查,包括一个详细的 API。 一个红队代理根据这些规范编写单元和集成测试,并进行审查。 然后,一个绿队代理被赋予相同的规范和 API,并实现实际的功能或修复,并针对测试进行迭代,而无法访问测试本身,只能知道哪些测试失败以及它们测试的内容。这可以防止它“作弊”通过测试。 最后,一旦测试通过,一个审查代理会根据规范审查实施情况。 这很好。它允许混合和匹配模型、思维水平和提供商。但绿队和红队有时会偏离最初的规范和 API,有时是有充分理由的。 因此,引入了另一个代理来评估这些偏差,如果它们是有效的改进,则从规范生成点重新启动流程,并提供新的见解。又一个反馈循环。 最后,将日志、OTel 跟踪和堆栈跟踪集成到流程中。这些代理似乎非常擅长筛选这些信息,端到端的可见性大大提高了结果。再次,一个反馈循环。 到目前为止,这就是我所做的。很想看看其他人对这个有什么要分享的!
查看原文
In some of the comment threads around here a few of you shared interesting ideas and patterns, enough that I believe everyone interesting in harness engineering is working on some sort of software dark factory or another.<p>We have OpenAI’s Symphony[1], StrongDM’s Factory[2], Yegge’s GasTown[3], and probably a few others I’ve missed.<p>So I’m curious. What have you been working on? What have learned? What has worked and what has failed? And what do you think comes after?<p>I’ll go first. The first thing I tried that yielded interesting results was, when possible, providing a ground truth or reference for the model to iterate against: screenshots or mockups for UI work, API contracts and unit &#x2F; integration tests for logic. That’s the Ralph Loop we all know and love. A feedback loop.<p>The second (obvious, I know) was splitting planning and implementation.<p>Reviews by other models and iterative loops came next, with appreciable results. However the implementing agent would often wiggle out by deferring things into oblivion or saying things that were actually important feedback were out of scope. Another feedback loop. I’ve found turning those reviews into &quot;hard gates&quot; has its own set of issue, as reviewing agents will always find something to nitpick about, turning this iterative implementation approaches into near infinite loops.<p>Combining these reviews and committing plans alongside the code led to an interesting accident: reviewing agents spontaneously and unexpectedly picked up on those and drastically improved their feedbacks by comparing plan and implementation (should have been obvious, and you’ll imagine my surprise the first time GitHub Copilot actually provided useful feedbacks instead of the usual typo nitpicks).<p>Then a comment here led me to an adversarial green team &#x2F; red team process.<p>A first agent creates a spec (based on StrongDM’s NLSpec) from my initial plan and gets it reviewed, including a detailed API.<p>A red team agent writes unit and integration test based on these specs, and gets them reviewed.<p>Then a green team agent is given those same specs and API, and implements the actual feature or fix, and iterates against the tests, without any access to the tests themselves, only which tests failed and what they were testing. This prevents it from gaming the tests.<p>Finally, once tests pass, a reviewing agent reviews the implementation against the specs.<p>This was nice. And it allows mixing and matching models, thinking levels, and providers. But both green and red team would sometimes diverge from the initial specs and API, sometimes with good reasons.<p>So another agent was brought in to evaluate those divergences when they occur and, if they are valid improvements, restart the process from the spec generation point, with the new insights. Yet another feedback loop.<p>And finally, integrating logs, OTel traces, and stack traces into the process. These agents seem remarkably capable at sifting through these, and end-to-end observability drastically improved results. Again, a feedback loop.<p>That’s all for me so far. Curious to see what everyone else has to share about this!