Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

Kibum Kim; Jiwan Kim; Kyle Min; Yueqi Wang; Jinyoung Moon; Julian McAuley; Chanyoung Park

arXiv:2604.20937·cs.LG·April 24, 2026

Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

Kibum Kim, Jiwan Kim, Kyle Min, Yueqi Wang, Jinyoung Moon, Julian McAuley, Chanyoung Park

PDF

TL;DR

This paper introduces Sink-Token-aware Pruning (SToP), a method to improve fine-grained video understanding in efficient Video LLMs by identifying and suppressing sink tokens that impair detailed visual reasoning.

Contribution

The paper reveals sink tokens as a key obstacle in fine-grained video understanding and proposes SToP, a simple plug-and-play method that enhances existing pruning techniques.

Findings

01

SToP significantly improves performance on fine-grained tasks.

02

Applying SToP allows pruning up to 90% of tokens with minimal accuracy loss.

03

The method boosts various benchmarks including hallucination and reasoning tasks.

Abstract

Video Large Language Models (Video LLMs) incur high inference latency due to a large number of visual tokens provided to LLMs. To address this, training-free visual token pruning has emerged as a solution to reduce computational costs; however, existing methods are primarily validated on Multiple-Choice Question Answering (MCQA) benchmarks, where coarse-grained cues often suffice. In this work, we reveal that these methods suffer a sharp performance collapse on fine-grained understanding tasks requiring precise visual grounding, such as hallucination evaluation. To explore this gap, we conduct a systematic analysis and identify sink tokens--semantically uninformative tokens that attract excessive attention--as a key obstacle to fine-grained video understanding. When these sink tokens survive pruning, they distort the model's visual evidence and hinder fine-grained understanding.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.