我如何用四块 RTX 3090 显卡运行一个每月 6 美元的无限 AI 服务
4 分•作者: yolo-auto•7 天前
这是一个关于我如何启动一个面向约 60 名在等候名单上的热门用户的无限 LLM 提供商的故事,然后立即向他们提供了一个完全失灵的死循环模型,以及为什么大多数人非常合理地离开了,但由于一些非常友善的人仍然坚持下来,我们让项目得以继续,尽管仍然很混乱,但正在获得关注。
稍微回顾一下——我相信 AI 代理的全部意义在于它们应该持续工作。它们应该读取文件、重试、搜索、编码、总结、运行工具并循环直到任务完成。当你的雇主为你付费时,谁会在乎成本,但当涉及到我的个人金钱/爱好时,如果每一次循环都感觉像是一笔小小的财务支出,你就会开始像看管代理一样盯着它,而不是使用它,这并不好玩。
另一方面,按量计费让我担心使用过多。订阅式计费让我觉得我需要用尽每一分钱,否则我就是在“浪费它”。如果有一个无限提供商就好了……
然后我加入了 AMD 开发者计划——我获得了一些积分,可以启动我自己的 MI300x,并开始在 AMD 上进行 vllm/sglang 推理服务的试验。
在了解了 AMD MI300x 之后,我做了一些草率的计算:
每小时租用 MI300x 2.00 美元 = 每月约 1500 美元。它大概可以支持约 150 名用户使用小型 MOE 模型,如 qwen-35b-3a,也许更多。
1500 美元 / 150 名用户 = 每月 10.00 美元,我们都可以以很小的价格玩代理。
你可以稍微超额订阅,所以我最终确定为每位用户每月 6 美元,提供 2 个生成槽位,128k 上下文,无 token 限制,无速率限制。
我构建了网站、路由器,创建了等候名单,然后将 MI300x 过度优化,以至于 vllm bench 的输出速度达到了 3k+,吞吐量达到了 40k+……但我没有测试最终的配置/服务命令……这就是我灾难性启动的地方。你无法提示它,否则它就会陷入循环或出现 bug,它被诅咒了。这就是我们失去许多人的地方。
幸运的是,我的朋友有几块 3090,所以他向我伸出了援手,开始为我们托管 qwen,使用 2 块 3090,我们终于有了一个可用的模型,而不会让我们可怜的 3 名用户每小时花费 2.00 美元。
我们开始吸引更多用户,所以我们升级到了 4 块 3090。我们还有很多空间容纳更多用户,但即便如此,从那时起:
我们配置 vllm 错误了大约 15 次
一块 GPU 坏了
我们停电了
我为 openclaw、hermes、pi-mono 制作了一堆一键启动,但它们都没有真正正常工作,这可能吓跑了人们。这些仍然在我们的网站上。
……但那些懂行的人似乎非常喜欢这个价格点。总的来说,我们有大约 98% 的正常运行时间。已经过去一个月了。我们都学到了很多,即使我们已经有 SWE/SE/AI 的背景,但要对几个付费用户负责,迫使我们真正专注于为他们提供一个好产品。现在我认为我们可能快要能支付电费/托管费了,这样我们就不会亏本运营了(如果算上 3090 的资本支出,我们仍然亏本)。
我们的盈亏平衡点是迁移到云端以最大化利用 MI300x,它现在已经调优完毕,一旦我们获得用户就可以投入使用。
我发现,在某些领域,订阅我们的服务比运行模型更便宜(但作为一个热爱本地模型的人,我完全理解)。
从那时起,我一直在开发一个桌面代理,它实际上可以与 qwen 等小型模型一起工作——这将取代那些损坏的一键启动。它很基础,但它是开箱即用的东西。我将其开源了,你可以在这里看到我所说的:https://github.com/yolo-auto-org/yolo-auto-desktop,我们的网站是 yolo-auto.com,我们有一个糟糕的免费套餐来证明它有效!
总之,希望你从中获得了一些乐趣或觉得有趣!如果你有任何问题,请随时提出。
查看原文
This submission is a tale about how I launched an unlimited LLM provider to about 60 hyped people on the waitlist, then immediately served them a fully dysfunctional death-loop model, and how most people, very reasonably, disappeared, but thanks to a few extremely nice people stuck around anyway, we kept the project alive and its still pretty chaotic but gaining traction.<p>To back up a little bit-- I believe that the whole point of AI agents is that they should keep working. They should read files, retry, search, code, summarize, run tools, and loop until the job is done. When your employer is paying for it, who cares about cost, but when it comes to my personal money/hobbies, if every loop feels like a tiny financial event, you start babysitting the agent instead of using it, and its not fun.<p>On the other hand, metered pricing makes me worry about using too much. Usage subscriptions make me feel like I need to use every last magical % or I'm are "wasting it". If only an unlimited provider existed....<p>Then I joined the AMD developer program - I got some credits to spin up my own MI300x and started tinkering with vllm/sglang inference serving on AMD.<p>After learning about AMD MI300x , i did some napkin math:<p>Renting MI300x at 2.00 an hour = ~$1500 a month . It can probably support about 150 users using a small MOE model, like qwen-35b-3a , maybe more.<p>1500 / 150= $10.00 per month, and we all get to play with agents for a small price.<p>You can oversubscribe a bit, so i landed on $6 per month, per user, for 2x generation slots, 128k context, no token limits, no rate limits.<p>I built the site, router, made a waitlist, and then over-optimized the MI300x to the point where vllm bench had like 3k+ output and 40k+ throughput.... But i didn't test the final config/serve commands... And that's where i did a disaster launch. You couldn't prompt the thing without it looping or bugging out, it was cursed. And that's where we lost alot of people.<p>Luckily, my buddy had a few 3090s, so he threw me a life boat and began hosting qwen for us on 2x 3090s and we finally had an operational model that wasn't costing $2.00 an hour for our whopping 3 users.<p>We started gaining a more users, so we moved up to 4x 3090s. Which we have plenty of room for more users, but even so, since then:<p>we've configured vllm wrong like 15 times
a GPU died
we lost power
I made a bunch of one-click starts for openclaw,hermes,pi-mono and none of them really work right and that probably drives people away. Those are still on our site right now.<p>...but people that know what they are doing seem to really be liking the price point. All in all we have like 98% up time. Its been about a month. We've both learned a ton, even already having backgrounds in SWE/SE/AI , being on the hook for a couple paying users forced us to really focus on delivering them a good product. And now i think we might be close to paying the power/hosting bill so we're not operating at a loss (if u include 3090 capex were still at aloss).<p>Our break-even point is moving to the cloud to max out a MI300x, which is now tuned and ready to go once we get the users.<p>And im finding in some areas, subscribing to our service is cheaper than running the model (but as someone who loves local models, i totally get it).<p>Since then, I've been working on a desktop agent that actually works with small models like qwen -- thats going to replace the broken 1 click starts. It's barebones, but its something out of the box that just works. I made it open source, you can see what im talking about here: https://github.com/yolo-auto-org/yolo-auto-desktop , we're at yolo-auto.com and we have an abysmal free tier to prove it works!<p>Anyway, hope you got a laugh or found it interesting! Drop a question if you have any.