设计用于处理 100 万 QPS CPM 广告的 DSP 架构,同时避免超支
1 分•作者: charzlie•14 天前
我正在为一个高吞吐量的 AdTech DSP 设计系统架构,希望得到构建过大规模竞价/投放系统的人的反馈。<p>约束/目标<p>仅限 DSP(不涉及交易所)<p>目标:100 万次广告请求/秒<p>端到端 DSP 延迟预算:约 100 毫秒<p>定价模式:CPM<p>硬性要求:广告商或广告系列不得超支<p>定向/广告系列获取<p>我使用 Redis + Roaring Bitmaps 对定向(地理位置、兴趣等)进行建模<p>仅获取候选广告系列:<p>Redis:约 1000 次请求/秒,约 8 毫秒(本地机器,非云端)<p>Aerospike:约 200–400 次请求/秒,约 10 毫秒<p>这仅是广告系列获取,不包括竞价或评分<p>预算/钱包模型<p>广告商有一个钱包<p>广告系列有:<p>总预算<p>每日预算<p>每日支出跟踪<p>超支是不可接受的(即使是小百分比,在高规模下也很重要)<p>考虑过的预算控制方法<p>将每日预算拆分成小时级桶<p>通过以下方式进行速率限制:<p>令牌桶<p>PID 控制器<p>这些方法减少了超支,但不能保证在突发流量下的正确性<p>最近考虑使用微单位(整数货币单位)来减少舍入误差<p>未决问题<p>在 100 万 QPS 下,人们实际上是如何保证预算的?<p>软超支,然后进行对账?<p>在热路径中进行硬性原子检查?<p>基于 Redis 位图的定向在此规模下是否可行,还是每个人最终都会:<p>预先物化广告系列集合?<p>将逻辑推送到内存/C++?<p>您如何平衡:<p>严格的预算执行<p>低延迟<p>高吞吐量<p>而不会引入全局锁或跨区域争用?<p>“永远不超支”是一个现实的要求,还是有界误差是行业规范?<p>我对教科书式的答案不太感兴趣,更感兴趣的是实际在生产环境中有效(或失败)的经验。
查看原文
I’m working on the system architecture for a high-throughput AdTech DSP and would love feedback from people who’ve built large-scale bidding / serving systems.<p>Constraints / Goals<p>DSP only (no exchange)<p>Target: 1M ad requests/sec<p>End-to-end DSP latency budget: ~100ms<p>Pricing model: CPM<p>Hard requirement: no advertiser or campaign overspend<p>Targeting / Campaign Fetch<p>I modeled targeting (geo, interests, etc.) using Redis + Roaring Bitmaps<p>Fetching candidate campaigns alone:<p>Redis: ~1000 RPS at ~8ms (Local machine not Cloud)<p>Aerospike: ~200–400 RPS at ~10ms<p>This is only campaign fetching, not bidding or scoring<p>Budget / Wallet Model<p>Advertiser has a wallet<p>Campaign has:<p>Total budget<p>Daily budget<p>Daily spend tracking<p>Overspend is not acceptable (even small % matters at scale)<p>Budget Control Approaches Considered<p>Splitting daily budgets into hourly buckets<p>Rate limiting via:<p>Token bucket<p>PID controllers<p>These reduce overspend but don’t guarantee correctness under bursty traffic<p>Recently considering micros (integer currency units) to reduce rounding errors<p>Open Questions<p>At 1M QPS, how do people actually enforce budget guarantees?<p>Soft overspend with reconciliation?<p>Hard atomic checks in the hot path?<p>Is Redis bitmap–based targeting viable at this scale, or does everyone eventually:<p>Pre-materialize campaign sets?<p>Push logic into memory / C++?<p>How do you balance:<p>Strict budget enforcement<p>Low latency<p>High throughput
without introducing global locks or cross-region contention?<p>Is “no overspend ever” a realistic requirement, or is bounded error the industry norm?<p>I’m less interested in textbook answers and more in what has actually worked (or failed) in production