Reducing Peak Memory Usage for Modern Multimodal Large Language Model Pipelines
Junwan Kim, Hyunkyung Bae

TL;DR
This paper introduces a method to reduce peak memory usage in multimodal large language models by applying structure-aware cache compression during inference, enabling more efficient multimodal processing.
Contribution
It proposes a sequential input-compression mechanism that controls memory growth throughout inference by exploiting inherent redundancies in vision tokens.
Findings
Significantly reduces peak memory during inference.
Maintains generative performance with minimal degradation.
Enables more practical multimodal inference in large models.
Abstract
Multimodal large language models (MLLMs) have recently demonstrated strong capabilities in understanding and generating responses from diverse visual inputs, including high-resolution images and long video sequences. As these models scale to richer visual representations, inference increasingly relies on storing large numbers of vision tokens in the key-value (KV) cache, making memory consumption a central bottleneck. Existing methods address this issue by identifying redundancy in vision tokens and compressing the cache, but such compression is typically applied only after all inputs are processed, resulting in high peak memory usage during the prefill stage. In this work, we show that MLLMs exhibit inherent structural regularities and representational redundancy that can be exploited to control memory growth throughout inference. Based on this insight, we propose a sequential…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
