vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models

Peiqi Yin; Jiangyun Zhu; Han Gao; Chenguang Zheng; Yongxiang Huang; Taichang Zhou; Ruirui Yang; Weizhi Liu; Weiqing Chen; Canlin Guo; Didan Deng; Zifeng Mo; Cong Wang; James Cheng; Roger Wang; Hongsheng Liu

arXiv:2602.02204·cs.DC·February 3, 2026

vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models

Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, Didan Deng, Zifeng Mo, Cong Wang, James Cheng, Roger Wang, Hongsheng Liu

PDF

Open Access

TL;DR

vLLM-Omni introduces a fully disaggregated serving system that efficiently handles complex any-to-any multimodal models by decomposing architectures into interconnected stages, significantly improving performance and resource utilization.

Contribution

It proposes a novel stage abstraction and disaggregated backend enabling flexible, efficient serving of complex multimodal models with multiple interconnected components.

Findings

01

Reduces job completion time by up to 91.4%.

02

Supports flexible GPU allocation and per-stage batching.

03

Enables decomposition of complex architectures into interconnected stages.

Abstract

Any-to-any multimodal models that jointly handle text, images, video, and audio represent a significant advance in multimodal AI. However, their complex architectures (typically combining multiple autoregressive LLMs, diffusion transformers, and other specialized components) pose substantial challenges for efficient model serving. Existing serving systems are mainly tailored to a single paradigm, such as autoregressive LLMs for text generation or diffusion transformers for visual generation. They lack support for any-to-any pipelines that involve multiple interconnected model components. As a result, developers must manually handle cross-stage interactions, leading to huge performance degradation. We present vLLM-Omni, a fully disaggregated serving system for any-to-any models. vLLM-Omni features a novel stage abstraction that enables users to decompose complex any-to-any architectures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Generative Adversarial Networks and Image Synthesis