HackerNews中文版

TL;DR：许多安全机制并非在遭受攻击时失效，而是在部分故障期间失效。本文档记录了为分布式系统设计的故障感知安全框架的早期设计笔记。问题所在在生产环境中的分布式系统中，安全问题常常发生在系统半正常工作时：身份验证服务降级 → 重试次数激增备用路径扩大了访问权限恢复逻辑成为攻击面虽然没有被“利用”，但系统变得不安全。大多数安全模型都假定组件稳定且故障是干净的。而真实系统并非如此。设计假设我们假设：故障是相关的重试是具有对抗性的超时是不安全默认设置恢复路径与稳定状态逻辑同等重要我们不假设：全局一致性完美的身份验证可靠的时钟集中式强制执行框架理念（高层次）这项工作探索了四个想法： 1. 故障感知信任信任在故障期间会降低，而不仅仅是在被攻破时部分中断期间，访问权限会自动收紧 2. 运行时安全不变性持续强制执行不变性违规行为会触发遏制措施，而不是警报 3. 重试安全的原语幂等、单调、有界副作用重试不能提升权限 4. 将安全视为可观察状态信任级别、降级和遏制是可见的如果无法观察，就无法保护这不是什么不是零信任营销不是合规性不是一个完成的系统这是一种将故障视为常态，而非例外的尝试。为什么尽早发布？因为许多真实的故障：不适合干净的研究论文发生在事件期间，而不是攻击期间在生产系统之外是不可见的我们分享设计笔记是为了在进一步形式化或评估之前获得反馈。欢迎反馈如果您在中断期间看到安全回归，或者重试导致不安全行为，我很乐意听取您的意见。这是一项正在进行的工作。不声称具有新颖性或完整性。

查看原文

TL;DR: Many security mechanisms fail not during attacks, but during partial outages. This post documents early design notes for a failure-aware security framework for distributed systems.The problemIn production distributed systems, security often breaks when things are half working:auth services degrade → retries explodefallback paths widen accessrecovery logic becomes the attack surfaceNothing is “exploited”, yet the system becomes unsafe.Most security models assume stable components and clean failures. Real systems don’t behave that way.Design assumptionsWe assume:correlated failuresretries are adversarialtimeouts are unsafe defaultsrecovery paths matter as much as steady-state logicWe don’t assume:global consistencyperfect identityreliable clockscentralized enforcementFramework ideas (high level)This work explores four ideas:1. Failure-aware trustTrust degrades under failure, not just compromiseAccess narrows automatically during partial outages2. Security invariants at runtimeInvariants are continuously enforcedViolations trigger containment, not alerts3. Retry-safe security primitivesIdempotent, monotonic, side-effect boundedRetries can’t escalate privilege4. Security as observable stateTrust level, degradation, and containment are visibleIf you can’t observe it, you can’t secure itWhat this is notNot zero trust marketingNot complianceNot a finished systemIt’s an attempt to treat failure as the normal case, not an exception.Why publish this early?Because many real failures:don’t fit clean research papershappen during incidents, not attacksare invisible outside production systemsWe’re sharing design notes to get feedback before formalizing or evaluating further.Feedback welcomeIf you’ve seen security regressions during outages or retries causing unsafe behavior, I’d like to hear about it.This is ongoing work. No claims of novelty or completeness.

部分故障期间的安全中断——来自分布式系统的设计笔记