HackerNews中文版

Hi HN，我构建了一个系统，使用多 GPU 内存溢出，在较旧的 Pascal GPU（P100 + GTX 1080 Ti）上运行 350 亿参数的语言模型。问题：大多数 LLM 推理工具（Ollama、LM Studio）仅限于单个 GPU VRAM（在 16GB GPU 上最多 130 亿模型）。如果你有多个较旧的 GPU，第二个 GPU 就会闲置。解决方案：多 GPU + CPU 内存溢出，采用 QLoRA 4 位量化。该系统会自动将层分布在 GPU0 → GPU1 → CPU RAM 上，从而在通常最大只能运行 130 亿参数的硬件上实现 350 亿参数的模型。基准测试（P100 16GB + GTX 1080 Ti 11GB）： - Qwen-14B：每秒 13.7 个 token（9.4GB VRAM） - OPT-30B：每秒 5.4 个 token（15.2GB VRAM） - CodeLlama-34B：每秒 0.8 个 token（16.7GB VRAM）快速开始： ```bash docker pull rickeshtn/large-model-international_release:latest docker run -it --rm --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=268435456 -v $(pwd):/workspace -e HF_HOME=/workspace/model_cache rickeshtn/large-model-international_release:latest python /app/interactive_chat.py --model-name Qwen/Qwen2.5-14B-Instruct ``` 技术细节： - QLoRA 4 位 NF4 量化（减少 75% 的内存占用） - HuggingFace Transformers + Accelerate + bitsandbytes - 自动设备映射，带有 CPU 卸载 - 具有对话持久性的交互式聊天 GitHub：https://github.com/rickeshtn/locallm-pascal Docker Hub：https://hub.docker.com/r/rickeshtn/large-model-international_release 已有 34 位用户正在运行。很乐意回答技术问题！

查看原文

Hi HN,<p><pre><code> I built a system to run 35B parameter language models on older Pascal GPUs (P100 + GTX 1080 Ti) using multi-GPU memory spillover. Problem: Most LLM inference tools (Ollama, LM Studio) are limited to single GPU VRAM (~13B models max on a 16GB GPU). If you have multiple older GPUs, the second one sits idle. Solution: Multi-GPU + CPU memory spillover with QLoRA 4-bit quantization. The system automatically distributes layers across GPU0 → GPU1 → CPU RAM, enabling 35B models on hardware that normally maxes at 13B. Benchmarks (P100 16GB + GTX 1080 Ti 11GB): - Qwen-14B: 13.7 tokens/sec (9.4GB VRAM) - OPT-30B: 5.4 tokens/sec (15.2GB VRAM) - CodeLlama-34B: 0.8 tokens/sec (16.7GB VRAM) Quick start: docker pull rickeshtn/large-model-international_release:latest docker run -it --rm --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=268435456 -v $(pwd):/workspace -e HF_HOME=/workspace/model_cache rickeshtn/large-model-international_release:latest python /app/interactive_chat.py --model-name Qwen/Qwen2.5-14B-Instruct Technical details: - QLoRA 4-bit NF4 quantization (75% memory reduction) - HuggingFace Transformers + Accelerate + bitsandbytes - Automatic device mapping with CPU offload - Interactive chat with conversation persistence GitHub: https://github.com/rickeshtn/locallm-pascal Docker Hub: https://hub.docker.com/r/rickeshtn/large-model-international_release 34 users already running it. Happy to answer technical questions!</code></pre>

使用 QLoRA 在双 Pascal GPU 上运行 350 亿参数的 LLM