CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning
Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, Ruixiang Tang

TL;DR
This paper introduces CATP, a training-free token pruning method for multimodal in-context learning that significantly reduces image tokens, enhances efficiency, and improves performance across multiple models and benchmarks.
Contribution
We propose a novel, training-free, contextually adaptive token pruning method specifically designed for multimodal in-context learning, addressing redundancy and stability issues.
Findings
Reduces image tokens by 77.8% with a 0.6% performance gain
Decreases inference latency by an average of 10.78%
Outperforms all baseline pruning methods in experiments
Abstract
Modern large vision-language models (LVLMs) convert each input image into a large set of tokens that far outnumber the text tokens. Although this improves visual perception, it also introduces severe image token redundancy. Because image tokens contain sparse information, many contribute little to reasoning but greatly increase inference cost. Recent image token pruning methods address this issue by identifying important tokens and removing the rest. These methods improve efficiency with only small performance drops. However, most of them focus on single-image tasks and overlook multimodal in-context learning (ICL), where redundancy is higher and efficiency is more important. Redundant tokens weaken the advantage of multimodal ICL for rapid domain adaptation and lead to unstable performance. When existing pruning methods are applied in this setting, they cause large accuracy drops,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsIoT-based Smart Home Systems · Multimodal Machine Learning Applications · Indoor and Outdoor Localization Technologies
