Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

Yinchao Ma; Qiang Zhou; Zhibin Wang; Xianing Chen; Hanqing Yang; Jun Song; Bo Zheng

arXiv:2602.01649·cs.CV·March 3, 2026

Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

Yinchao Ma, Qiang Zhou, Zhibin Wang, Xianing Chen, Hanqing Yang, Jun Song, Bo Zheng

PDF

Open Access 1 Models

TL;DR

This paper introduces CaCoVID, a reinforcement learning-based method for selecting video tokens that contribute most to accurate understanding, significantly reducing computational costs while maintaining performance.

Contribution

The paper presents a novel reinforcement learning framework with a combinatorial policy optimization algorithm for contribution-aware video token compression.

Findings

01

Effective token selection improves video understanding accuracy.

02

Reduces computational overhead in video models.

03

Accelerates policy convergence with online sampling.

Abstract

Video large language models have demonstrated remarkable capabilities in video understanding tasks. However, the redundancy of video tokens introduces significant computational overhead during inference, limiting their practical deployment. Many compression algorithms are proposed to prioritize retaining features with the highest attention scores to minimize perturbations in attention computations. However, the correlation between attention scores and their actual contribution to correct answers remains ambiguous. To address the above limitation, we propose a novel \textbf{C}ontribution-\textbf{a}ware token \textbf{Co}mpression algorithm for \textbf{VID}eo understanding (\textbf{CaCoVID}) that explicitly optimizes the token selection policy based on the contribution of tokens to correct predictions. First, we introduce a reinforcement learning-based framework that optimizes a policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
moraleyc/CaCoVID_LLaVA-OneVision
model· 8 dl
8 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning