HackerNews中文版

Hi HN，我一直对当今云端 LLM 推理的工作方式感到沮丧。每次 API 调用都从头开始：你重新发送整个提示词 + 对话历史，并且你为每个输入 token 付费，即使模型之前已经“见过”该上下文。这导致了两个大问题：性能和成本——不断地重新发送输入 token 是浪费。质量损失——因为状态每次都在新的 GPU 上重建，模型会丢失大量内部上下文，而不仅仅是你的文本。行业中提供的大多数“优化”实际上只是提示词缓存。这对于削减重复的输入成本很有用，但我们都看到了副作用：输出与提示词的细微变化不匹配，或者模型自信地“跳到”错误的缓存响应，因为它认为你的查询是近乎重复的。我们正在用 ark-labs.cloud 采取不同的方法：真正的有状态推理——当你开始一个会话时，所有请求都在同一组 GPU 上处理，并且模型完整的内部状态（提示词、历史记录、推理轨迹）在调用之间被保留。零输入 token 成本——因为模型不需要你在每个请求上重新发送你的输入。你只需为生成的输出付费。更好的响应，而不仅仅是更便宜的响应——维护内部状态可以提高一致性和推理质量，而不仅仅是省钱。从开发者的角度来看，这很简单：启用 cookie，API 将保持会话活动（ark_session_id）。没有 SDK 魔术，没有黑客手段。会话在不活动后会过期以释放资源，但在它们活动期间，你正在与一个真正内部记忆的模型对话，而不仅仅是通过提示词的字符串连接。文档 <a href="https://ark-labs.cloud/documentation" rel="nofollow">https://ark-labs.cloud/documentation</a> 我们很乐意听取你的想法——特别是那些一直在与“为什么我为已经发送的 token 支付 10 倍的费用”问题作斗争的人，或者那些遇到提示词与输出不匹配的缓存系统的人。这种方法对你来说有意义吗？

查看原文

Hi HN,I’ve been frustrated for a while with how LLM inference works in the cloud today. Every API call starts from scratch: you resend your entire prompt + conversation history, and you’re charged for every input token, even if the model has already “seen” that context before.This leads to two big problems:Performance & cost – constantly resending input tokens is wasteful.Quality loss – because the state is rebuilt on a new GPU each time, the model loses a lot of internal context beyond just your text.Most “optimizations” offered in the industry are really just prompt-caching. That’s useful for cutting repeated input costs, but we’ve all seen the side-effects: outputs that don’t match subtle variations in the prompt, or the model confidently “jumping” to the wrong cached response because it thought your query was a near-duplicate.We’re taking a different approach with ark-labs.cloud:True stateful inference – when you start a session, all requests are processed on the same set of GPUs, and the full internal state of the model (prompt, history, reasoning traces) is preserved between calls.Zero input token cost – because the model doesn’t need you to resend your input on each request. You pay only for generated output.Better responses, not just cheaper ones – maintaining the internal state can improve consistency and reasoning quality, not just save money.From a developer perspective, it’s simple: enable cookies, and the API will keep a session alive (ark_session_id). No SDK magic, no hacks. Sessions do expire after inactivity to free resources, but while they’re active, you’re talking to a model that actually remembers internally, not just through string concatenation of prompts.Docs <a href="https://ark-labs.cloud/documentation/" rel="nofollow">https://ark-labs.cloud/documentation/</a>We’d love your thoughts — especially from those who’ve wrestled with the “why am I paying 10x for tokens I already sent” problem, or who’ve hit caching systems that mismatched prompts to outputs. Does this approach make sense to you?

Show HN: 有状态 LLM 推理 (输入 token 零成本，非 prompt 缓存)