TL;DR
ReDiPrune is a training-free token pruning method for multimodal LLMs that selects relevant and diverse visual tokens before the projection layer, improving efficiency without retraining.
Contribution
It introduces a novel, plug-and-play token pruning technique that operates before the vision-language projector, enhancing accuracy-efficiency trade-offs in multimodal models.
Findings
Retaining 15% of tokens improves accuracy by 2.0% on EgoSchema.
ReDiPrune reduces computation by over 6 times in TFLOPs.
It outperforms post-projection pruning methods across multiple benchmarks.
Abstract
Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present ReDiPrune, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and semantic cues. Each token is scored by a lightweight rule that jointly consider text-conditioned relevance and max-min diversity, ensuring the selected tokens are both query-relevant and non-redundant. ReDiPrune is fully plug-and-play, requiring no retraining or architectural modifications, and can be seamlessly inserted between the encoder and projector. Across four video and five image benchmarks, it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
