HackerNews中文版

大家好，HN！我们是 Aakash 和 Viswesh，正在构建 Canary (<a href="https://www.runcanary.ai">https://www.runcanary.ai</a>)。我们构建 AI 智能体，它们可以阅读你的代码库，弄清楚 pull request 到底改了什么，并为每个受影响的用户工作流程生成并执行测试。Aakash 和我之前在 Windsurf、Cognition 和 Google 构建过 AI 编码工具。AI 工具让每个团队在发布方面都变得更快，但在合并之前，没有人测试真实的用户行为。PR 变得越来越大，代码审查仍然发生在文件差异中，而看起来干净的更改却导致生产环境中的结账、身份验证和计费出现问题。我们亲眼目睹了这一切。我们创建 Canary 就是为了弥补这一差距。它的工作原理如下：Canary 首先连接到你的代码库，并了解你的应用程序是如何构建的：路由、控制器、验证逻辑。你推送一个 PR，Canary 读取差异，理解更改背后的意图，然后针对你的预览应用程序生成并运行测试，检查真实的用户端到端流程。它直接在 PR 上评论，提供测试结果和录像，显示更改了什么，并标记任何行为异常的地方。你还可以通过 PR 评论触发特定的用户工作流程测试。除了 PR 测试之外，从 PR 生成的测试可以移动到回归测试套件中。你还可以通过简单地用通俗易懂的英语提示你想要测试的内容来创建测试。Canary 从你的代码库生成一个完整的测试套件，安排它，并持续运行它。我们的一位建筑技术客户有一个发票流程，应付金额与最初的提案总额相差约 1,600 美元。Canary 在发布前就发现了他们发票流程中的回归问题。这并非单一的基础模型家族可以单独完成的事情。质量保证（QA）涵盖了多种模式，例如源代码、DOM/ARIA、设备模拟器、视觉验证、分析屏幕录像、网络/控制台日志、实时浏览器状态等，任何单一模型都无法专门处理。你还需要自定义浏览器集群、用户会话、临时环境、设备端农场和数据播种，才能可靠地运行测试。最重要的是，捕捉代码更改的二阶效应需要一个专门的工具，它会以多种可能的方式破坏应用程序，影响不同类型的用户，而正常的快乐路径测试流程无法做到这一点。为了衡量我们专门构建的 QA 智能体的表现，我们发布了 QA-Bench v0，这是第一个用于代码验证的基准测试。给定一个真实的 PR，一个 AI 模型能否识别出每个受影响的用户工作流程并生成相关的测试？我们针对 Grafana、Mattermost、Cal.com 和 Apache Superset 上的 35 个真实 PR，在相关性、覆盖范围和一致性这三个维度上，测试了我们专门构建的 QA 智能体与 GPT 5.4、Claude Code (Opus 4.6) 和 Sonnet 4.6 的对比情况。覆盖范围是差距最大的地方。Canary 领先 GPT 5.4 11 分，领先 Claude Code 18 分，领先 Sonnet 4.6 26 分。要了解完整的测试方法和每个代码库的细分，请阅读我们的基准测试报告：<a href="https://www.runcanary.ai/blog/qa-bench-v0">https://www.runcanary.ai/blog/qa-bench-v0</a>你可以在这里查看产品演示：<a href="https://youtu.be/NeD9g1do_BU" rel="nofollow">https://youtu.be/NeD9g1do_BU</a>我们非常欢迎任何从事代码验证或考虑如何以不同方式衡量这一指标的人提供反馈。

查看原文

Hey HN! We're Aakash and Viswesh, and we're building Canary (<a href="https://www.runcanary.ai">https://www.runcanary.ai</a>). We build AI agents that read your codebase, figure out what a pull request actually changed, and generate and execute tests for every affected user workflow.Aakash and I previously built AI coding tools at Windsurf, Cognition, and Google. AI tools were making every team faster at shipping, but nobody was testing real user behavior before merge. PRs got bigger, reviews still happened in file diffs, and changes that looked clean broke checkout, auth, and billing in production. We saw it firsthand. We started Canary to close that gap. Here's how it works:Canary starts by connecting to your codebase and understands how your app is built: routes, controllers, validation logic. You push a PR and Canary reads the diff, understands the intent behind the changes, then generates and runs tests against your preview app checking real user flows end to end. It comments directly on the PR with test results and recordings showing what changed and flagging anything that doesn't behave as expected. You can also trigger specific user workflow tests via a PR comment.Beyond PR testing, tests generated from the PR can be moved into regression suites. You can also create tests by just prompting what you want tested in plain English. Canary generates a full test suite from your codebase, schedules it, and runs it continuously. One of our construction tech customers had an invoicing flow where the amount due drifted from the original proposal total by ~$1,600. Canary caught the regression in their invoice flow before release.This isn't something a single family of foundation models can do on its own. QA spans across many modalities like source code, DOM/ARIA, device emulators, visual verifications, analyzing screen recordings, network/console logs, live browser state etc. for any single model to be specialized in. You also need custom browser fleets, user sessions, ephemeral environments, on-device farms and data seeding to run the tests reliably. On top of that, catching second-order effects of code changes requires a specialized harness that breaks the application in multiple possible ways across different types of users that a normal happy path testing flow wouldn't.To measure how well our purpose built QA agent works, we published QA-Bench v0, the first benchmark for code verification. Given a real PR, can an AI model identify every affected user workflow and produce relevant tests? We tested our purpose-built QA agent against GPT 5.4, Claude Code (Opus 4.6), and Sonnet 4.6 across 35 real PRs on Grafana, Mattermost, Cal.com, and Apache Superset on three dimensions: Relevance, Coverage, and Coherence. Coverage is where the gap was largest. Canary leads by 11 points over GPT 5.4, 18 over Claude Code, and 26 over Sonnet 4.6. For full methodology and per-repo breakdowns give our benchmark report a read: <a href="https://www.runcanary.ai/blog/qa-bench-v0">https://www.runcanary.ai/blog/qa-bench-v0</a>You can check out the product demo here: <a href="https://youtu.be/NeD9g1do_BU" rel="nofollow">https://youtu.be/NeD9g1do_BU</a>We'd love feedback from anyone working on code verification or thinking about how to measure this differently.

Launch HN: Canary (YC W26) – Canary（YC W26）—— 懂你代码的 AI 质量保证工具