ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving
Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, \'I\~nigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, Rodrigo Fonseca

TL;DR
ModServe is a modular, adaptive system designed to efficiently serve large multimodal models by optimizing stage decoupling, scheduling, and autoscaling, resulting in significant throughput and cost improvements.
Contribution
This paper introduces ModServe, the first system to enable modality- and stage-aware resource disaggregation for scalable multimodal model serving.
Findings
ModServe achieves 3.3-5.5x higher throughput.
It reduces costs by 25-41.3%.
It effectively handles bursty traffic and tail latency requirements.
Abstract
Large multimodal models (LMMs) demonstrate impressive capabilities in understanding images, videos, and audio beyond text. However, efficiently serving LMMs in production environments poses significant challenges due to their complex architectures and heterogeneous characteristics across their multi-stage inference pipelines. We present the first comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention, across six representative open-source models, revealing key systems design implications. We also present an in-depth analysis of production LMM inference traces, uncovering unique workload characteristics, including variable, heavy-tailed request distributions and bursty traffic patterns. Based on these insights, we propose ModServe, a modular LMM serving system that decouples stages for independent optimization and adaptive scaling. ModServe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Computational Techniques and Applications · Power Systems and Technologies · Speech and dialogue systems
