vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models
Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, Didan Deng, Zifeng Mo, Cong Wang, James Cheng, Roger Wang, Hongsheng Liu

TL;DR
vLLM-Omni introduces a fully disaggregated serving system that efficiently handles complex any-to-any multimodal models by decomposing architectures into interconnected stages, significantly improving performance and resource utilization.
Contribution
It proposes a novel stage abstraction and disaggregated backend enabling flexible, efficient serving of complex multimodal models with multiple interconnected components.
Findings
Reduces job completion time by up to 91.4%.
Supports flexible GPU allocation and per-stage batching.
Enables decomposition of complex architectures into interconnected stages.
Abstract
Any-to-any multimodal models that jointly handle text, images, video, and audio represent a significant advance in multimodal AI. However, their complex architectures (typically combining multiple autoregressive LLMs, diffusion transformers, and other specialized components) pose substantial challenges for efficient model serving. Existing serving systems are mainly tailored to a single paradigm, such as autoregressive LLMs for text generation or diffusion transformers for visual generation. They lack support for any-to-any pipelines that involve multiple interconnected model components. As a result, developers must manually handle cross-stage interactions, leading to huge performance degradation. We present vLLM-Omni, a fully disaggregated serving system for any-to-any models. vLLM-Omni features a novel stage abstraction that enables users to decompose complex any-to-any architectures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Generative Adversarial Networks and Image Synthesis
