Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

Jialong Qin; Xin Zou; Di Lu; Yibo Yan; Xuming Hu

arXiv:2511.08003·cs.CV·December 5, 2025

Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

Jialong Qin, Xin Zou, Di Lu, Yibo Yan, Xuming Hu

PDF

Open Access 1 Video

TL;DR

SharpV introduces an adaptive, information-aware visual token pruning method for VideoLLMs that reduces computational complexity and can improve performance by selectively removing redundant visual information.

Contribution

It presents SharpV, a novel two-stage pruning framework that dynamically adjusts pruning ratios based on spatial-temporal information without needing attention scores, enhancing efficiency and compatibility.

Findings

01

SharpV outperforms existing methods on multiple benchmarks.

02

It achieves hierarchical cache pruning guided by visual information degradation.

03

The method maintains or improves performance with reduced computational cost.

Abstract

Current Video Large Language Models (VideoLLMs) suffer from quadratic computational complexity and key-value cache scaling, due to their reliance on processing excessive redundant visual tokens. To address this problem, we propose SharpV, a minimalist and efficient method for adaptive pruning of visual tokens and KV cache. Different from most uniform compression approaches, SharpV dynamically adjusts pruning ratios based on spatial-temporal information. Remarkably, this adaptive mechanism occasionally achieves performance gains over dense models, offering a novel paradigm for adaptive pruning. During the KV cache pruning stage, based on observations of visual information degradation, SharpV prunes degraded visual features via a self-calibration manner, guided by similarity to original visual features. In this way, SharpV achieves hierarchical cache pruning from the perspective of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Ferroelectric and Negative Capacitance Devices · Advanced Neural Network Applications