HackerNews中文版

ElasticMM 是一个新发布的开源服务系统，专为现代多模态大型语言模型（MLLM）设计。该研究成果被选为 NeurIPS 2025 的口头报告。与主要针对纯文本工作负载优化的现有服务栈（如 vLLM）不同，ElasticMM 引入了弹性多模态并行（EMP），这是一种新的执行范式，可在不同的推理阶段和模态之间调整并行度。论文的主要发现： * TTFT（首次令牌时间）降低高达 4.2 倍 * 在混合多模态工作负载下，吞吐量提高 3.2 倍至 4.5 倍 * 模态感知调度、弹性阶段划分、统一前缀缓存和非阻塞编码论文（OpenReview PDF）： [https://openreview.net/pdf?id=Zd6VyjmN1S](https://openreview.net/pdf?id=Zd6VyjmN1S) GitHub 仓库： [https://github.com/hpdps-group/ElasticMM](https://github.com/hpdps-group/ElasticMM) 很想听听 HN 社区的看法，特别是那些构建 LLM/MLLM 推理栈或处理生产环境中多模态服务的用户。

查看原文

ElasticMM is a newly released open-source serving system designed for modern multimodal large language models (MLLMs). The work was selected as an Oral presentation at NeurIPS 2025.Unlike existing serving stacks such as vLLM—which are primarily optimized for text-only workloads—ElasticMM introduces Elastic Multimodal Parallelism (EMP), a new execution paradigm that adapts parallelism across different inference stages and modalities.Key findings from the paper:Up to 4.2× reduction in TTFT3.2×–4.5× higher throughput under mixed multimodal workloadsModality-aware scheduling, elastic stage partitioning, unified prefix caching, and non-blocking encodingPaper (OpenReview PDF): <a href="https://openreview.net/pdf?id=Zd6VyjmN1S" rel="nofollow">https://openreview.net/pdf?id=Zd6VyjmN1S</a>GitHub repo: <a href="https://github.com/hpdps-group/ElasticMM" rel="nofollow">https://github.com/hpdps-group/ElasticMM</a>Curious to hear what the HN community thinks, especially from those building LLM/MLLM inference stacks or dealing with multimodal serving in production.

Show HN: ElasticMM – 4.2 倍加速多模态 LLM 服务 (NeurIPS 2025 口头报告)