TL;DR
OTPrune is a novel, training-free visual token pruning method for multi-modal large language models that uses optimal transport to align token distributions, reducing inference costs while maintaining performance.
Contribution
It introduces a distribution alignment framework via optimal transport for token pruning, with a tractable submodular objective and theoretical guarantees, improving efficiency and stability.
Findings
OTPrune outperforms state-of-the-art methods in efficiency-performance tradeoffs.
The method preserves local diversity and global representativeness of visual tokens.
Theoretical analysis confirms the stability and semantic faithfulness of the pruning process.
Abstract
Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, OTPrune preserves both local diversity and global representativeness while reducing inference cost. Moreover, we derive a tractable submodular objective that enables efficient optimization, and theoretically prove its monotonicity and submodularity, providing a principled foundation for stable and efficient pruning. We further provide a comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
