Show HN: 我们如何在 2 块 GPU 上运行 60 个 Hugging Face 模型

2作者: pveldandi7 天前
大多数开源 LLM 部署都假设一个 GPU 对应一个模型。如果流量稳定,这种方法是可行的。但在实践中,许多工作负载是长尾或间歇性的,这意味着 GPU 大部分时间都处于空闲状态。 我们尝试了一种不同的方法。 我们没有将一个模型固定到一个 GPU 上,而是: * 将模型权重存储在快速本地磁盘上 * 仅在请求时将模型加载到 GPU 内存中 * 保留一小部分驻留工作集 * 积极驱逐非活动模型 * 通过单个 OpenAI 兼容端点路由所有内容 在我们最近的测试设置(2×A6000,每张 48GB)中,我们提供了大约 60 个 Hugging Face 文本模型供激活。任何给定时间只有少数模型驻留在 VRAM 中;其余模型在需要时恢复。 冷启动仍然存在。较大的模型需要几秒钟才能恢复。但通过避免预热池和每个模型使用专用 GPU,轻负载下的整体利用率显着提高。 演示视频:[https://m.youtube.com/watch?v=IL7mBoRLHZk](https://m.youtube.com/watch?v=IL7mBoRLHZk) 在线演示:[https://inferx.net:8443/demo/](https://inferx.net:8443/demo/) 如果有人正在运行多模型推理,并希望使用他们自己的模型对这种方法进行基准测试,我很乐意提供临时访问权限以供测试。
查看原文
Most open-source LLM deployments assume one model per GPU. That works if traffic is steady. In practice, many workloads are long-tail or intermittent, which means GPUs sit idle most of the time.<p>We experimented with a different approach.<p>Instead of pinning one model to one GPU, we: •Stage model weights on fast local disk •Load models into GPU memory only when requested •Keep a small working set resident •Evict inactive models aggressively •Route everything through a single OpenAI-compatible endpoint<p>In our recent test setup (2×A6000, 48GB each), we made ~60 Hugging Face text models available for activation. Only a few are resident in VRAM at any given time; the rest are restored when needed.<p>Cold starts still exist. Larger models take seconds to restore. But by avoiding warm pools and dedicated GPUs per model, overall utilization improves significantly for light workloads.<p>Short demo here:<a href="https:&#x2F;&#x2F;m.youtube.com&#x2F;watch?v=IL7mBoRLHZk" rel="nofollow">https:&#x2F;&#x2F;m.youtube.com&#x2F;watch?v=IL7mBoRLHZk</a><p>Live demo to play with: <a href="https:&#x2F;&#x2F;inferx.net:8443&#x2F;demo&#x2F;" rel="nofollow">https:&#x2F;&#x2F;inferx.net:8443&#x2F;demo&#x2F;</a><p>If anyone here is running multi-model inference and wants to benchmark this approach with their own models, I’m happy to provide temporary access for testing.