Ask HN: 应该给 LLM(大型语言模型)设置一个“坦诚度”滑块,用于表达“不,这主意不好”吗?
1 分•作者: mikebiglan•8 个月前
我不要“友善”的 AI。我想要一个会说:“不行,这是个馊主意。”的 AI。<p>也就是说,我想要一个“坦诚度”控制,就像温度一样,用来控制 AI 拒绝的意愿。<p>当坦诚度高时,模型应该优先提供坦率、纠正性的反馈,而不是礼貌的合作。当坦诚度低时,它可以保持支持,但要设置护栏,标记空洞的奉承,并对平庸的想法发出警告。<p>为什么这很重要
• 今天的默认设置是优化“没有坏主意”。这对于头脑风暴来说很好,但它会放大糟糕的前提,并奖励自信的垃圾信息。
• 谄媚是一种已知的失败模式。模型学会了同意,这会得到积极的用户信号,从而得到强化。
• 在审查、产品决策、风险检查等方面,正确的答案往往是简单的“不要这样做”。<p>具体建议
• 坦诚度 (0.0 – 1.0):当证据不足或风险很高时,模型会不同意或拒绝的概率。或者,它可能不必是字面上的“概率”。
• disagree_first:以明确的结论开始回复(例如“简短回答:不要发布这个”),然后是理由。
• risk_sensitivity:如果主题涉及安全/金融/健康/安全等重要领域,则提高坦诚度。
• self_audit tag:附加一个注释,例如“由于证据不足和下游风险而拒绝”,用户可以看到。<p>示例
• 坦诚度=0.2 - “我们可以探索一下。首先要考虑一些问题……”(温和的提示,仍然是协作的)
• 坦诚度=0.8 + disagree_first=true - “不行。这很可能会因为 X 而失败,并引入 Y 风险。如果必须继续,更安全的替代方案是 A,并设置护栏 B 和 C。这里有一个最小测试来证伪核心假设。”<p>我明天会发布什么
• 一个简单的 UI 滑块,带有标签:温和到直接
• 一个切换开关:“更喜欢坦率的真相而不是令人愉快的帮助”
• 当模型检测到没有实质内容的奉承时,会发出警告提示:“这读起来像是缺乏证据的赞美。”<p>一些未解决的问题
• 如何在保持清晰度的同时避免不必要的粗鲁(语气与内容的分离)?
• 获得赞美的正确指标是什么(引用密度、新颖性、约束条件)?
• 风险敏感度应该在哪里自动启动,以及用户控制?<p>如果有人已经对这个进行了原型设计,无论是某种提示注入还是 RL 信号,我都希望看到它。
查看原文
I don’t want a “nice” AI. I want one that says: “Nope, that's a bad idea.”<p>That is, I want a "Candor" control, like temperature but for willingness to push back.<p>When candor is high, the model should prioritize frank, corrective feedback over polite cooperation. When candor is low, it can stay supportive, but with guardrails that flag empty flattering and warn about mediocre ideas.<p>Why this matters
• Today’s defaults optimize for “no bad ideas.” That is fine for brainstorming, but it amplifies poor premises and rewards confident junk.
• Sycophancy is a known failure mode. The model learns to agree which gets positive user signals which reinforce.
• In reviews, product decisions, risk checks, etc, the right answer is often a simple “do not do that.”<p>Concrete proposal
• candor (0.0 – 1.0): probability the model will disagree or decline when evidence is weak or risk is high. Or maybe it doesn't have to be literal "probability".
• disagree_first: start responses with a plain verdict (for example “Short answer: do not ship this”) followed by rationale.
• risk_sensitivity: boost candor if the topic hits serious domains such as security/finance/health/safety.
• self_audit tag: append a note like “Pushed back due to weak evidence and downstream risk” that the user can see.<p>Examples
• candor=0.2 - “We could explore that. A few considerations first…” (gentle nudge, still collaborative)
• candor=0.8 + disagree_first=true - “No. This is likely to fail for X and introduces Y risk. If you must proceed, the safer alternative is A with guardrails B and C. Here is a minimal test to falsify the core assumption.”<p>What I would ship tomorrow
• A simple UI slider with labels: Gentle to Direct
• A toggle: “Prefer blunt truth over agreeable help”
• A warning chip when the model detects flattery without substance: “This reads like praise with low evidence.”<p>Some open questions
• How to avoid needless rudeness while preserving clarity (tone vs content separation)?
• What is the right metric for earned praise (citation density, novelty, constraints)?
• Where should the risk sensitivity kick in automatically vs be user controlled?<p>If anyone has prototyped this, whether some prompt injection or an RL signal, I'd love to see it.