TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts
Hao Zhang, Mengsi Lyu, Bo Huang, Yulong Ao, Yonghua Lin

TL;DR
This paper introduces an adaptive visual token pruning method for large multimodal models handling long contexts with multiple images, significantly reducing inference costs while preserving performance.
Contribution
It proposes a novel two-stage pruning approach that accounts for intra-image and inter-image redundancy, optimizing token selection in complex multimodal scenarios.
Findings
Reduces visual tokens by up to 80% in long context tasks.
Maintains model performance despite significant token reduction.
Effectively balances diversity and text alignment in token pruning.
Abstract
Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model. However, the growing number of visual tokens greatly increases inference cost. Visual token pruning has emerged as a promising solution. However, existing methods often overlook scenarios involving long context inputs with multiple images. In this paper, we analyze the challenges of visual token pruning in long context, multi-image settings and introduce an adaptive pruning method tailored for such scenarios. We decompose redundancy into intra-image and inter-image components and quantify them through intra-image diversity and inter-image variation, which jointly guide dynamic budget allocation. Our approach consists of two stages. The intra-image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
