Show HN: 我们如何在 2 块 GPU 上运行 60 个 Hugging Face 模型
2 分•作者: pveldandi•7 天前
大多数开源 LLM 部署都假设一个 GPU 对应一个模型。如果流量稳定,这种方法是可行的。但在实践中,许多工作负载是长尾或间歇性的,这意味着 GPU 大部分时间都处于空闲状态。
我们尝试了一种不同的方法。
我们没有将一个模型固定到一个 GPU 上,而是:
* 将模型权重存储在快速本地磁盘上
* 仅在请求时将模型加载到 GPU 内存中
* 保留一小部分驻留工作集
* 积极驱逐非活动模型
* 通过单个 OpenAI 兼容端点路由所有内容
在我们最近的测试设置(2×A6000,每张 48GB)中,我们提供了大约 60 个 Hugging Face 文本模型供激活。任何给定时间只有少数模型驻留在 VRAM 中;其余模型在需要时恢复。
冷启动仍然存在。较大的模型需要几秒钟才能恢复。但通过避免预热池和每个模型使用专用 GPU,轻负载下的整体利用率显着提高。
演示视频:[https://m.youtube.com/watch?v=IL7mBoRLHZk](https://m.youtube.com/watch?v=IL7mBoRLHZk)
在线演示:[https://inferx.net:8443/demo/](https://inferx.net:8443/demo/)
如果有人正在运行多模型推理,并希望使用他们自己的模型对这种方法进行基准测试,我很乐意提供临时访问权限以供测试。
查看原文
Most open-source LLM deployments assume one model per GPU. That works if traffic is steady. In practice, many workloads are long-tail or intermittent, which means GPUs sit idle most of the time.<p>We experimented with a different approach.<p>Instead of pinning one model to one GPU, we:
•Stage model weights on fast local disk
•Load models into GPU memory only when requested
•Keep a small working set resident
•Evict inactive models aggressively
•Route everything through a single OpenAI-compatible endpoint<p>In our recent test setup (2×A6000, 48GB each), we made ~60 Hugging Face text models available for activation. Only a few are resident in VRAM at any given time; the rest are restored when needed.<p>Cold starts still exist. Larger models take seconds to restore. But by avoiding warm pools and dedicated GPUs per model, overall utilization improves significantly for light workloads.<p>Short demo here:<a href="https://m.youtube.com/watch?v=IL7mBoRLHZk" rel="nofollow">https://m.youtube.com/watch?v=IL7mBoRLHZk</a><p>Live demo to play with: <a href="https://inferx.net:8443/demo/" rel="nofollow">https://inferx.net:8443/demo/</a><p>If anyone here is running multi-model inference and wants to benchmark this approach with their own models, I’m happy to provide temporary access for testing.