Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
Kibum Kim, Jiwan Kim, Kyle Min, Yueqi Wang, Jinyoung Moon, Julian McAuley, Chanyoung Park

TL;DR
This paper introduces Sink-Token-aware Pruning (SToP), a method to improve fine-grained video understanding in efficient Video LLMs by identifying and suppressing sink tokens that impair detailed visual reasoning.
Contribution
The paper reveals sink tokens as a key obstacle in fine-grained video understanding and proposes SToP, a simple plug-and-play method that enhances existing pruning techniques.
Findings
SToP significantly improves performance on fine-grained tasks.
Applying SToP allows pruning up to 90% of tokens with minimal accuracy loss.
The method boosts various benchmarks including hallucination and reasoning tasks.
Abstract
Video Large Language Models (Video LLMs) incur high inference latency due to a large number of visual tokens provided to LLMs. To address this, training-free visual token pruning has emerged as a solution to reduce computational costs; however, existing methods are primarily validated on Multiple-Choice Question Answering (MCQA) benchmarks, where coarse-grained cues often suffice. In this work, we reveal that these methods suffer a sharp performance collapse on fine-grained understanding tasks requiring precise visual grounding, such as hallucination evaluation. To explore this gap, we conduct a systematic analysis and identify sink tokens--semantically uninformative tokens that attract excessive attention--as a key obstacle to fine-grained video understanding. When these sink tokens survive pruning, they distort the model's visual evidence and hinder fine-grained understanding.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
