我厌倦了大型语言模型(LLM)技能的混乱,所以我用回归测试构建了自己的模型。
3 分•作者: iliaov•28 天前
我最近尝试了 Garry Tan 的 GStack 等技能,花了一周时间,发现它存在一些缺陷(我将另发文章详述)。
我的问题是:我如何知道一个技能或提示是否好用(例如 GStack 的 `/office-hours`)?
我该如何比较类似的技能(例如不同的“深度研究”技能)?
发现软件缺陷(相对而言)很容易——它会崩溃、报错。而有缺陷的技能不会。那些完美无瑕、听起来自信满满的技能却会经常误导我,浪费我的时间,甚至让我觉得不如不使用 LLM。
AI 技能也是软件——它们应该附带回归测试。
LLM 团队拥有大量的提示回归测试。LLM 包装的 SaaS 公司也拥有大量的提示回归测试。但对于开源技能,SKILL.md 文件看起来很合理,却没有任何测试(例如,截至撰写本文时,GStack 的 `/office-hours` 就没有)。
Garry Tan,如果你听到我说话——请考虑为你的 `/office-hours`、`/plan-ceo-review`、`/plan-eng-review` 等技能提供回归测试。
回归测试应该:
1. 证明技能能正确工作
2. 展示正确和不正确的用法
3. 证明技能的价值
4. 附带评分标准,以便对技能进行基准测试
5. 最后一个是最有价值的,因为它允许你对相似的技能进行相互基准测试。
于是我开始自己做这件事。
这是一个正在进行中的例子:`plan-cmo-review`,一个用于补充 GStack 的技能,因为 GStack 在撰写本文时缺少营销方面的评审。我不是营销专家,分享这个技能的目的是概述其回归测试的设置。
简而言之,这是我的探索过程:
* 我对几个产品使用了 GStack,并发现生成的 `design_document.md` 导致我失败,主要是营销方面。
* 我借助 Claude Opus 4.8 手动深入研究了该技能的失败之处,并最终找到了正确的解决方案。
* 我让 Claude 构建了一个 `plan-cmo-review` 技能,运行它,结果得出了一个有缺陷的解决方案(与 GStack 的输出类似)。
* 我让 Claude 分析了正确的(手动)解决方案,并将其作为带有评分标准的回归测试固定项。
* Claude 运行了(盲测)回归测试——失败了。我们进行了几次迭代,发现了关键问题:Claude 盲目信任我的提示,将其视为最终真理。Claude 认为 GStack 知道自己在做什么。GStack 认为我知道我在做什么。但我当时正在进行产品/创业研究——根据定义,“研究”就是当你不知道自己在做什么时所做的事情。这种信任链导致了技能的失败。
* 我们解决了信任问题,回归测试通过了。我们又添加了一些。它们也通过了。
* 我让 Claude 多次运行回归测试——出现了裂痕。Claude 迭代了该技能。现在它们都通过了。
* 这种方法仍然存在缺陷。我想尝试运行不同的 LLM,进行跨模型判断,以及更多的回归测试。
技能 GitHub 地址:github.com/remakeai/plan-cmo-review。笔记在 iliaov.substack.com。
查看原文
I've recently tried skills like Garry Tan's GStack, spent a week with it, and realized it has some flaws (I'll post separately about that).<p>Here's my problem: how do I know if a skill or prompt is any good (e.g. GStack's /office-hours)?<p>How do I compare similar skills (e.g. different "deep research" skills)?<p>Spotting broken software is (relatively) easy — it crashes, prints errors. Broken skills don't. Perfectly polished, confident-sounding skills routinely mislead me and waste my time, to the point where I wish I weren't using an LLM at all.<p>AI skills are software — and they should come with regression tests.<p>LLM teams have tons of prompt regression tests. LLM-wrapper SaaS companies have tons of prompt regression tests. But when it comes to open-source skills, SKILL.md reads reasonable, yet ships with zero tests (e.g. GStack's /office-hours has none at the time of writing).<p>Garry Tan, if you hear me — please consider shipping regression tests for your /office-hours, /plan-ceo-review, /plan-eng-review, and so on.<p>Regression tests should:<p>1. Prove the skill works correctly<p>2. Demonstrate correct and incorrect usage<p>3. Prove the skill's value<p>4. Come with a scoring rubric to allow skill benchmarking<p>5. The last one is the most valuable, because it lets you benchmark similar skills against each other.<p>So I started doing this myself.<p>Here's a work-in-progress example: plan-cmo-review, a skill to complement GStack since GStack is missing a marketing review at the time of writing. I'm not a marketing guy; the point of sharing this skill is to outline its regression setup.<p>Briefly, here's how my exploration progressed:<p>- I used GStack on a couple of products and realized the resulting design_document.md was leading me to failure, mainly marketing-wise.<p>- I dug into the skill's failures manually with Claude Opus 4.8's help and ended up finding the correct solution.<p>- I asked Claude to build a plan-cmo-review skill, ran it, and it arrived at a flawed solution (similar to GStack's output).<p>- I gave Claude the correct (manual) solution to analyze and add as a regression fixture with a scoring rubric.<p>- Claude ran the (blind) regression — it failed. We iterated several times and found the key problem: Claude was trusting my prompts implicitly as the ultimate truth. Claude believed GStack knew what it was doing. GStack believed I knew what I was doing. But I was doing product/startup research — and by definition, "research" is what you do when you don't know what you're doing. That trust chain is what broke the skills.<p>- We fixed the trust problem and the regression test passed. We added a few more. They passed.<p>- I had Claude run the regressions multiple times — cracks appeared. Claude iterated the skill. Now they pass.<p>- This methodology is still flawed. I'd like to try running different LLMs, cross-model judging, and a lot more regression tests.<p>Skill github.com/remakeai/plan-cmo-review . Notes at iliaov.substack.com .