Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
Yulin Zhao, Yun Wang, Dehua Zheng, Borui jiang, Zheng Zhang

TL;DR
This paper introduces SPpruner, a subject-centric progressive visual token reduction method for vision-language models that improves efficiency while maintaining accuracy by mimicking human visual focus and context understanding.
Contribution
The paper proposes a novel focus-then-context paradigm with explicit saliency modeling and contextual aggregation, enhancing token reduction effectiveness in vision-language models.
Findings
Achieves up to 2.53x speedup with only 22.2% tokens retained in Qwen2.5-VL.
Reduces FLOPs by 67% on LLaVA with minimal accuracy loss.
Outperforms state-of-the-art token reduction methods across multiple benchmarks.
Abstract
Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentionally preserve the isolated visual subject strictly aligned with the user's query, which fails to substantially explore salient subjects and their contextual relationships. In this paper, we propose SPpruner, a subject-centric progressive reduction paradigm that emulates the \textit{Focus-then-Context} mechanism of the human visual perception system. Specifically, we first construct a focus identification module to explicitly model the interplay between visual saliency and semantic relevance. Herein, it can excavate the comprehensive visual subject spectrum to ensure a high-fidelity representation of visual input. Subsequently, a context-aware structural scanning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
