Show HN:我故意削弱了我们的编码代理
10 分•作者: noahfradin•26 天前
简而言之:我训练了一个分类器,用于根据请求将成本最低的模型和推理深度进行路由。结合额外的自动化代币效率技术,在相同花费下,使用量提升了 3 倍。有兴趣自行尝试的,请访问:<a href="https://nerfguard.com" rel="nofollow">https://nerfguard.com</a>
最近,我和几位同事从 Claude Code 转向了 Codex。我们仍然在两者之间切换,但 Codex 的速度和可控性以及性能提升是显而易见的。缺点之一是每代币定价的门槛来得更早。这种情况普遍存在,但在 Codex 上我们感受尤为强烈。我们是一家创业公司,团队成员几乎全天候工作,并且对构建充满热情——自然而然,我们**每天**的账单就已经相当可观了。
幸运的是,我们正在追求一项宏大的使命,速度比边缘的微小代币花费更为重要。尽管如此,这让我们思考,为什么我们自己的产品在降低代币花费和加速代理工作流方面能带来数量级上的提升,而我们却在所有内部编码任务中不加优化地使用这些顶级模型。这种浪费显得相当荒谬——最明显的问题是,即使任务显然不需要,我们似乎也在对所有任务使用最高智能模型和最大推理深度。作为一个在缓存智能方面投入大量时间的公司,我们也很容易看到其他许多唾手可得的优化机会。
因此,在最近的一个周末,我快速构建了一个工具来优化我们的使用。其核心是一个**非常快速**的分类器,它能将你的请求分类到完成任务所需的最低智能级别,并在此基础上进行一些不错的代币优化。结果是,在代币花费降低数倍的情况下,质量大致相同。但对我们来说更令人兴奋的是,经过恰当的打包的智能和推理级别意味着我们的速度也显著提升了。这并非微不足道。
我们观察到节省高达 3 倍的成本,以及每人每天节省数小时的时间,这些时间原本会花在等待工具响应和编码代理回复上。
对我们而言,这意味着工程效率的提高和在相同花费下的使用量显著增加。这也意味着在达到速率限制之前可以有更多的使用量。
当我将此事告诉朋友时,他们也想开始使用它来最大化他们从编码代理计划中获得的使用量。现在,许多最前沿的 AI 公司中的工程师都在使用这个工具来以这种方式优化他们的代币利用率。这不仅是为了省钱,更是为了最大化产出。事实证明,避免被 Claude “削弱”的最佳方式是主动地、有选择地“削弱”自己。我们决定将其发布给其他开发者社区使用。现在,你可以为自己启用 Nerfguard,并立即开始获得更多使用量。
查看原文
Tl;dr: I trained a classifier to route to the least expensive model and reasoning depth to complete the request. Coupling that with additional automated token efficiency techniques has yielded 3x usage for the same spend. For anyone interested in trying it themselves: <a href="https://nerfguard.com" rel="nofollow">https://nerfguard.com</a><p>Various teammates and I switched over to Codex from Claude Code recently. We still bounce between the tools, but Codex’s speed and steerability coupled with performance gains were hard to ignore. One of the downsides was that the per token pricing kicked in way sooner. This is happening across the board, but we felt it in Codex more acutely. We’re a startup filled with people who work around the clock and are obsessed with building — naturally our <i>daily</i> bill alone was striking.<p>Luckily we’re going after a big mission and speed matters significantly more than marginal token spend on the edges. Still, it got us thinking about how it was ludicrous that while our product has a side effect of decreasing token spend and speeding up agentic workflows by many orders of magnitude, we were using these top tier models for all types of internal coding tasks without any of those optimizations. The waste felt pretty ridiculous — the most glaring culprit was that we were seemingly using the max intelligence model on max reasoning for every task even when the task clearly didn’t require it. As a company who spends a lot of time on cached intelligence, it was also easy for us to see how there was plenty of other low hanging fruit as well.<p>So, on a recent weekend, I quickly built a tool to optimize our usage. At its core is a <i>very fast</i> classifier that classifies your requests to the least intelligence required for the task and includes some nice token optimizations on top. The result is roughly the same quality for multiples lower token spend. But even more exciting for us, is that the properly bin packed intelligence and reasoning levels meant our speed also went up considerably. This wasn’t negligible.<p>We’ve observed up to 3x savings and hours per day per person in saved time that we would have otherwise been waiting on tool turns and coding agent responses.<p>For us, that means improved engineering velocity and significantly higher usage for the same spend. It also means more usage before getting throttled.<p>As I told friends about this, they also wanted to start using it to maximize the usage they could get out of their coding agent plans. There are now engineers across many of the most cutting edge AI companies using this tool to optimize their token utilization in this way. Not just to save money, but to maximize output. Turns out that the best way to avoid getting nerfed by Claude is to intentionally nerf yourself selectively. We decided to release it for the rest of the builder community to use as well. You can now turn on Nerfguard for yourself and start getting more usage today.