HackerNews中文版

大多数开源 LLM 部署都假设一个 GPU 对应一个模型。如果流量稳定，这种方法是可行的。但在实践中，许多工作负载是长尾或间歇性的，这意味着 GPU 大部分时间都处于空闲状态。我们尝试了一种不同的方法。我们没有将一个模型固定到一个 GPU 上，而是： * 将模型权重存储在快速本地磁盘上 * 仅在请求时将模型加载到 GPU 内存中 * 保留一小部分驻留工作集 * 积极驱逐非活动模型 * 通过单个 OpenAI 兼容端点路由所有内容在我们最近的测试设置（2×A6000，每张 48GB）中，我们提供了大约 60 个 Hugging Face 文本模型供激活。任何给定时间只有少数模型驻留在 VRAM 中；其余模型在需要时恢复。冷启动仍然存在。较大的模型需要几秒钟才能恢复。但通过避免预热池和每个模型使用专用 GPU，轻负载下的整体利用率显着提高。演示视频：[https://m.youtube.com/watch?v=IL7mBoRLHZk](https://m.youtube.com/watch?v=IL7mBoRLHZk) 在线演示：[https://inferx.net:8443/demo/](https://inferx.net:8443/demo/) 如果有人正在运行多模型推理，并希望使用他们自己的模型对这种方法进行基准测试，我很乐意提供临时访问权限以供测试。

查看原文

Most open-source LLM deployments assume one model per GPU. That works if traffic is steady. In practice, many workloads are long-tail or intermittent, which means GPUs sit idle most of the time.We experimented with a different approach.Instead of pinning one model to one GPU, we: •Stage model weights on fast local disk •Load models into GPU memory only when requested •Keep a small working set resident •Evict inactive models aggressively •Route everything through a single OpenAI-compatible endpointIn our recent test setup (2×A6000, 48GB each), we made ~60 Hugging Face text models available for activation. Only a few are resident in VRAM at any given time; the rest are restored when needed.Cold starts still exist. Larger models take seconds to restore. But by avoiding warm pools and dedicated GPUs per model, overall utilization improves significantly for light workloads.Short demo here:<a href="https://m.youtube.com/watch?v=IL7mBoRLHZk" rel="nofollow">https://m.youtube.com/watch?v=IL7mBoRLHZk</a>Live demo to play with: <a href="https://inferx.net:8443/demo/" rel="nofollow">https://inferx.net:8443/demo/</a>If anyone here is running multi-model inference and wants to benchmark this approach with their own models, I’m happy to provide temporary access for testing.

Show HN: 我们如何在 2 块 GPU 上运行 60 个 Hugging Face 模型