Show HN: 有状态 LLM 推理 (输入 token 零成本,非 prompt 缓存)

1作者: arkonrad9 个月前
Hi HN, 我一直对当今云端 LLM 推理的工作方式感到沮丧。每次 API 调用都从头开始:你重新发送整个提示词 + 对话历史,并且你为每个输入 token 付费,即使模型之前已经“见过”该上下文。 这导致了两个大问题: 性能和成本——不断地重新发送输入 token 是浪费。 质量损失——因为状态每次都在新的 GPU 上重建,模型会丢失大量内部上下文,而不仅仅是你的文本。 行业中提供的大多数“优化”实际上只是提示词缓存。这对于削减重复的输入成本很有用,但我们都看到了副作用:输出与提示词的细微变化不匹配,或者模型自信地“跳到”错误的缓存响应,因为它认为你的查询是近乎重复的。 我们正在用 ark-labs.cloud 采取不同的方法: 真正的有状态推理——当你开始一个会话时,所有请求都在同一组 GPU 上处理,并且模型完整的内部状态(提示词、历史记录、推理轨迹)在调用之间被保留。 零输入 token 成本——因为模型不需要你在每个请求上重新发送你的输入。你只需为生成的输出付费。 更好的响应,而不仅仅是更便宜的响应——维护内部状态可以提高一致性和推理质量,而不仅仅是省钱。 从开发者的角度来看,这很简单:启用 cookie,API 将保持会话活动(ark_session_id)。没有 SDK 魔术,没有黑客手段。会话在不活动后会过期以释放资源,但在它们活动期间,你正在与一个真正内部记忆的模型对话,而不仅仅是通过提示词的字符串连接。 文档 <a href="https://ark-labs.cloud/documentation" rel="nofollow">https://ark-labs.cloud/documentation</a> 我们很乐意听取你的想法——特别是那些一直在与“为什么我为已经发送的 token 支付 10 倍的费用”问题作斗争的人,或者那些遇到提示词与输出不匹配的缓存系统的人。这种方法对你来说有意义吗?
查看原文
Hi HN,<p>I’ve been frustrated for a while with how LLM inference works in the cloud today. Every API call starts from scratch: you resend your entire prompt + conversation history, and you’re charged for every input token, even if the model has already “seen” that context before.<p>This leads to two big problems:<p>Performance &amp; cost – constantly resending input tokens is wasteful.<p>Quality loss – because the state is rebuilt on a new GPU each time, the model loses a lot of internal context beyond just your text.<p>Most “optimizations” offered in the industry are really just prompt-caching. That’s useful for cutting repeated input costs, but we’ve all seen the side-effects: outputs that don’t match subtle variations in the prompt, or the model confidently “jumping” to the wrong cached response because it thought your query was a near-duplicate.<p>We’re taking a different approach with ark-labs.cloud:<p>True stateful inference – when you start a session, all requests are processed on the same set of GPUs, and the full internal state of the model (prompt, history, reasoning traces) is preserved between calls.<p>Zero input token cost – because the model doesn’t need you to resend your input on each request. You pay only for generated output.<p>Better responses, not just cheaper ones – maintaining the internal state can improve consistency and reasoning quality, not just save money.<p>From a developer perspective, it’s simple: enable cookies, and the API will keep a session alive (ark_session_id). No SDK magic, no hacks. Sessions do expire after inactivity to free resources, but while they’re active, you’re talking to a model that actually remembers internally, not just through string concatenation of prompts.<p>Docs <a href="https:&#x2F;&#x2F;ark-labs.cloud&#x2F;documentation&#x2F;" rel="nofollow">https:&#x2F;&#x2F;ark-labs.cloud&#x2F;documentation&#x2F;</a><p>We’d love your thoughts — especially from those who’ve wrestled with the “why am I paying 10x for tokens I already sent” problem, or who’ve hit caching systems that mismatched prompts to outputs. Does this approach make sense to you?