Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

Donglin Yu

arXiv:2603.12707·cs.LG·March 16, 2026

Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

Donglin Yu

PDF

Open Access

TL;DR

This paper introduces a cost-effective approach for multimodal large language model inference by leveraging cross-tier GPU heterogeneity, optimizing partitioning to reduce data transfer and improve efficiency on commodity hardware.

Contribution

It identifies the optimal modality boundary for partitioning, develops a phase-aware runtime, and demonstrates significant cost and performance improvements with heterogeneous GPU deployment.

Findings

01

Partitioning at the modality boundary minimizes cross-device transfer.

02

Heterogeneous deployment achieves up to 40.6% cost savings.

03

HeteroServe improves throughput and token efficiency on real models.

Abstract

Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution. Partitioning here reduces transfer complexity from $O (L * s_{c} t x)$ bytes (GB-scale KV caches under stage-level disaggregation) to $O (N_{v} * d)$ bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth. The result holds across attention mechanisms (MHA/GQA), dynamic vision resolutions, and model scales, and the advantage grows as models deepen. A direct implication is that existing stage-level disaggregation systems are constrained to high-bandwidth…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Embedded Systems Design Techniques