CROP: Contextual Region-Oriented Visual Token Pruning

Jiawei Guo; Feifei Zhai; Pu Jian; Qianrun Wei; Yu Zhou

arXiv:2505.21233·cs.CV·September 18, 2025

CROP: Contextual Region-Oriented Visual Token Pruning

Jiawei Guo, Feifei Zhai, Pu Jian, Qianrun Wei, Yu Zhou

PDF

Open Access 1 Video

TL;DR

CROP is a novel framework that efficiently prunes visual tokens in VQA tasks by identifying relevant regions and applying adaptive compression, significantly reducing computational costs while maintaining state-of-the-art accuracy.

Contribution

The paper introduces a two-step token pruning framework combining localization and adaptive pruning strategies, advancing visual token compression in VQA models.

Findings

01

CROP outperforms existing token pruning methods in VQA tasks.

02

It achieves state-of-the-art performance with reduced computational costs.

03

The approach effectively identifies and compresses relevant image regions.

Abstract

Current VLM-based VQA methods often process entire images, leading to excessive visual tokens that include redundant information irrelevant to the posed question. This abundance of unnecessary image details creates numerous visual tokens, drastically increasing memory and computational requirements in VLMs. To address this, we propose Contextual Region-Oriented Visual Token Pruning (CROP), a novel framework to compress visual tokens through a two-step process: Localization and Pruning. Specifically, CROP first employs an efficient model to identify the contextual region relevant to the input query. Subsequently, two distinct strategies are introduced for pruning: (1) Pre-LLM Compression (PLC), which adaptively compresses different image regions with varying ratios, and (2) Inner-LLM Pruning (ILP), a training-free method that prunes tokens within early LLM layers guided by the identified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CROP: Contextual Region-Oriented Visual Token Pruning· underline

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Interactive and Immersive Displays