HackerNews中文版

大家好，我一直在尝试对图像生成进行对抗性扰动，看看需要多大的失真才能阻止模型生成图像或使其偏离目标。这基本上没什么进展，这并不意外。然后我尝试了一些更奇怪的事情：我没有对抗模型，而是试图推动它将上传的图像本身分类为 NSFW（不宜在工作场所观看），这样它最终会触发自己的安全防护。结果证明这比预期的更有趣。它不一致，而且绝对不稳定，但在某些情况下，相对温和的变换就足以改变模型对原本良性图像的内部安全分类。这与绕过安全措施无关，如果说有的话，它恰恰相反。我的想法是故意给安全层本身施加压力。我计划将其作为小型工具 + UI 开源，一旦我能使行为更稳定和可重复，主要是作为一种探测和预先过滤审核流程的方式。如果它能可靠地工作，即使只是部分地，它至少可以提高那些喜欢滥用这些系统的人的成本。

查看原文

Hey guys, I was playing around with adversarial perturbations on image generation to see how much distortion it actually takes to stop models from generating or to push them off-target. That mostly went nowhere, which wasn’t surprising.<p>Then I tried something a bit weirder: instead of fighting the model, I tried pushing it to classify uploaded images itself as NSFW, so it ends up triggering its own guardrails.<p>This turned out to be more interesting than expected. It’s inconsistent and definitely not robust, but in some cases relatively mild transformations are enough to flip the model’s internal safety classification on otherwise benign images.<p>This isn’t about bypassing safeguards, if anything, it’s the opposite. The idea is to intentionally stress the safety layer itself. I’m planning to open-source this as a small tool + UI once I can make the behavior more stable and reproducible, mainly as a way to probe and pre-filter moderation pipelines.<p>If it works reliably, even partially, it could at least raise the cost for people who get their kicks from abusing these systems.

在图像模型中诱导自我 NSFW 分类以防止深度伪造编辑