部分故障期间的安全中断——来自分布式系统的设计笔记

3作者: sandhyavinjam6 个月前
TL;DR:许多安全机制并非在遭受攻击时失效,而是在部分故障期间失效。本文档记录了为分布式系统设计的故障感知安全框架的早期设计笔记。 问题所在 在生产环境中的分布式系统中,安全问题常常发生在系统半正常工作时: 身份验证服务降级 → 重试次数激增 备用路径扩大了访问权限 恢复逻辑成为攻击面 虽然没有被“利用”,但系统变得不安全。 大多数安全模型都假定组件稳定且故障是干净的。 而真实系统并非如此。 设计假设 我们假设: 故障是相关的 重试是具有对抗性的 超时是不安全默认设置 恢复路径与稳定状态逻辑同等重要 我们不假设: 全局一致性 完美的身份验证 可靠的时钟 集中式强制执行 框架理念(高层次) 这项工作探索了四个想法: 1. 故障感知信任 信任在故障期间会降低,而不仅仅是在被攻破时 部分中断期间,访问权限会自动收紧 2. 运行时安全不变性 持续强制执行不变性 违规行为会触发遏制措施,而不是警报 3. 重试安全的原语 幂等、单调、有界副作用 重试不能提升权限 4. 将安全视为可观察状态 信任级别、降级和遏制是可见的 如果无法观察,就无法保护 这不是什么 不是零信任营销 不是合规性 不是一个完成的系统 这是一种将故障视为常态,而非例外的尝试。 为什么尽早发布? 因为许多真实的故障: 不适合干净的研究论文 发生在事件期间,而不是攻击期间 在生产系统之外是不可见的 我们分享设计笔记是为了在进一步形式化或评估之前获得反馈。 欢迎反馈 如果您在中断期间看到安全回归,或者重试导致不安全行为,我很乐意听取您的意见。 这是一项正在进行的工作。不声称具有新颖性或完整性。
查看原文
TL;DR: Many security mechanisms fail not during attacks, but during partial outages. This post documents early design notes for a failure-aware security framework for distributed systems.<p>The problem<p>In production distributed systems, security often breaks when things are half working:<p>auth services degrade → retries explode<p>fallback paths widen access<p>recovery logic becomes the attack surface<p>Nothing is “exploited”, yet the system becomes unsafe.<p>Most security models assume stable components and clean failures. Real systems don’t behave that way.<p>Design assumptions<p>We assume:<p>correlated failures<p>retries are adversarial<p>timeouts are unsafe defaults<p>recovery paths matter as much as steady-state logic<p>We don’t assume:<p>global consistency<p>perfect identity<p>reliable clocks<p>centralized enforcement<p>Framework ideas (high level)<p>This work explores four ideas:<p>1. Failure-aware trust<p>Trust degrades under failure, not just compromise<p>Access narrows automatically during partial outages<p>2. Security invariants at runtime<p>Invariants are continuously enforced<p>Violations trigger containment, not alerts<p>3. Retry-safe security primitives<p>Idempotent, monotonic, side-effect bounded<p>Retries can’t escalate privilege<p>4. Security as observable state<p>Trust level, degradation, and containment are visible<p>If you can’t observe it, you can’t secure it<p>What this is not<p>Not zero trust marketing<p>Not compliance<p>Not a finished system<p>It’s an attempt to treat failure as the normal case, not an exception.<p>Why publish this early?<p>Because many real failures:<p>don’t fit clean research papers<p>happen during incidents, not attacks<p>are invisible outside production systems<p>We’re sharing design notes to get feedback before formalizing or evaluating further.<p>Feedback welcome<p>If you’ve seen security regressions during outages or retries causing unsafe behavior, I’d like to hear about it.<p>This is ongoing work. No claims of novelty or completeness.