CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

Yanshu Li; Jianjiang Yang; Zhennan Shen; Ligong Han; Haoyan Xu; Ruixiang Tang

arXiv:2508.07871·cs.CV·December 10, 2025

CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, Ruixiang Tang

PDF

Open Access 1 Video

TL;DR

This paper introduces CATP, a training-free token pruning method for multimodal in-context learning that significantly reduces image tokens, enhances efficiency, and improves performance across multiple models and benchmarks.

Contribution

We propose a novel, training-free, contextually adaptive token pruning method specifically designed for multimodal in-context learning, addressing redundancy and stability issues.

Findings

01

Reduces image tokens by 77.8% with a 0.6% performance gain

02

Decreases inference latency by an average of 10.78%

03

Outperforms all baseline pruning methods in experiments

Abstract

Modern large vision-language models (LVLMs) convert each input image into a large set of tokens that far outnumber the text tokens. Although this improves visual perception, it also introduces severe image token redundancy. Because image tokens contain sparse information, many contribute little to reasoning but greatly increase inference cost. Recent image token pruning methods address this issue by identifying important tokens and removing the rest. These methods improve efficiency with only small performance drops. However, most of them focus on single-image tasks and overlook multimodal in-context learning (ICL), where redundancy is higher and efficiency is more important. Redundant tokens weaken the advantage of multimodal ICL for rapid domain adaptation and lead to unstable performance. When existing pruning methods are applied in this setting, they cause large accuracy drops,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning· underline

Taxonomy

TopicsIoT-based Smart Home Systems · Multimodal Machine Learning Applications · Indoor and Outdoor Localization Technologies