为什么你的 RAG 成本高达每月 2400 美元(以及我们如何将其降低 73%)

2作者: helain18 天前
你正在生产环境中运行 RAG。然后,AWS 账单就来了。每月 2400 美元,每天 50 次查询。每次查询 48 美元。 我们为企业客户构建了一个 RAG 系统,并意识到大多数生产 RAG 都是优化灾难。文献痴迷于准确性,却完全忽略了单位经济效益。 三大成本构成 向量数据库(占账单的 40-50%) 标准的 RAG 管道对每个问题进行 3-5 次不必要的数据库查询。我们做了 5 次往返,而实际上只需要 1.5 次。 LLM API(占账单的 30-40%) 标准的 RAG 将 8-15k 个 token 泵入 LLM。这比必要的多了 5-10 倍。我们发现:超过 3,000 个 token 的上下文,准确性就会停滞不前。超出这个范围的都是噪音和成本。 基础设施(占账单的 15-25%) 向量数据库闲置、监控开销、不必要的负载均衡。 真正起作用的是什么 Token 感知上下文(节省 35%) 基于预算的组装,在您使用了足够的 token 后停止。之前:12k token/查询。之后:3.2k token。准确性相同。 python def _build_context(self, results, settings): max_tokens = settings.get("max_context_tokens", 2000) current_tokens = 0 for result in results: tokens = self.llm.count_tokens(result) if current_tokens + tokens <= max_tokens: current_tokens += tokens else: break 混合重排序(节省 25%) 70% 语义 + 30% 关键词评分。更好的排名意味着需要的块更少。前 20 → 前 8 检索,同时保持质量。 嵌入缓存(节省 20%) 工作区隔离的缓存,7 天 TTL。我们看到日内命中率为 45-60%。 python async def set_embedding(self, text, embedding, workspace_id=None): key = f"embedding:ws_{workspace_id}:{hash(text)}" await redis.setex(key, 604800, json.dumps(embedding)) 批量嵌入(节省 15%) 批量 API 定价每 token 便宜 30-40%。同时处理 50 个文本,而不是单独处理
查看原文
You&#x27;re running RAG in production. Then the AWS bill lands. $2,400&#x2F;month for 50 queries&#x2F;day. $48 per query.<p>We built a RAG system for enterprise clients and realized most production RAGs are optimization disasters. The literature obsesses over accuracy while completely ignoring unit economics.<p>The Three Cost Buckets Vector Database (40-50% of bill) Standard RAG pipelines do 3-5 unnecessary DB queries per question. We were making 5 round-trips for what should&#x27;ve been 1.5.<p>LLM API (30-40%) Standard RAG pumps 8-15k tokens into the LLM. That&#x27;s 5-10x more than necessary. We found: beyond 3,000 tokens of context, accuracy plateaus. Everything beyond that is noise and cost.<p>Infrastructure (15-25%) Vector databases sitting idle, monitoring overhead, unnecessary load balancing.<p>What Actually Moved the Needle Token-Aware Context (35% savings) Budget-based assembly that stops when you&#x27;ve used enough tokens. Before: 12k tokens&#x2F;query. After: 3.2k tokens. Same accuracy.<p>python def _build_context(self, results, settings): max_tokens = settings.get(&quot;max_context_tokens&quot;, 2000) current_tokens = 0 for result in results: tokens = self.llm.count_tokens(result) if current_tokens + tokens &lt;= max_tokens: current_tokens += tokens else: break Hybrid Reranking (25% savings) 70% semantic + 30% keyword scoring. Better ranking means fewer chunks needed. Top-20 → top-8 retrieval while maintaining quality.<p>Embedding Caching (20% savings) Workspace-isolated cache with 7-day TTL. We see 45-60% hit rate intra-day.<p>python async def set_embedding(self, text, embedding, workspace_id=None): key = f&quot;embedding:ws_{workspace_id}:{hash(text)}&quot; await redis.setex(key, 604800, json.dumps(embedding)) Batch Embedding (15% savings) Batch API pricing is 30-40% cheaper per token. Process 50 texts simultaneously instead of individu