Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity
Donglin Yu

TL;DR
This paper introduces a cost-effective approach for multimodal large language model inference by leveraging cross-tier GPU heterogeneity, optimizing partitioning to reduce data transfer and improve efficiency on commodity hardware.
Contribution
It identifies the optimal modality boundary for partitioning, develops a phase-aware runtime, and demonstrates significant cost and performance improvements with heterogeneous GPU deployment.
Findings
Partitioning at the modality boundary minimizes cross-device transfer.
Heterogeneous deployment achieves up to 40.6% cost savings.
HeteroServe improves throughput and token efficiency on real models.
Abstract
Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution. Partitioning here reduces transfer complexity from bytes (GB-scale KV caches under stage-level disaggregation) to bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth. The result holds across attention mechanisms (MHA/GQA), dynamic vision resolutions, and model scales, and the advantage grows as models deepen. A direct implication is that existing stage-level disaggregation systems are constrained to high-bandwidth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Embedded Systems Design Techniques
