ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video   Understanding

Xiao Wang; Qingyi Si; Jianlong Wu; Shiyu Zhu; Li Cao; and Liqiang Nie

arXiv:2412.20504·cs.CV·March 25, 2025

ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding

Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, and Liqiang Nie

PDF

Open Access 1 Repo

TL;DR

ReTaKe is a training-free method that reduces both temporal and knowledge redundancy in long videos, enabling VideoLLMs to process much longer videos efficiently and accurately.

Contribution

ReTaKe introduces two novel modules, DPSelect and PivotKV, to jointly compress visual and knowledge redundancy without additional training.

Findings

01

Enables processing of 8 times longer video frames.

02

Outperforms similar-sized models by 3-5% on key benchmarks.

03

Reduces decoding latency by approximately 20%.

Abstract

Video Large Language Models (VideoLLMs) have made significant strides in video understanding but struggle with long videos due to the limitations of their backbone LLMs. Existing solutions rely on length extrapolation, which is memory-constrained, or visual token compression, which primarily leverages low-level temporal redundancy while overlooking the more effective high-level knowledge redundancy. To address this, we propose $ReTaKe$ , a training-free method with two novel modules DPSelect and PivotKV, to jointly reduce both temporal visual redundancy and knowledge redundancy for video compression. To align with the way of human temporal perception, DPSelect identifies keyframes based on inter-frame distance peaks. To leverage LLMs' learned prior knowledge, PivotKV marks the keyframes as pivots and compress non-pivot frames by pruning low-attention tokens in their KV cache.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sczwangxiao/video-retake
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications

MethodsALIGN · Pruning · Focus