ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, and Liqiang Nie

TL;DR
ReTaKe is a training-free method that reduces both temporal and knowledge redundancy in long videos, enabling VideoLLMs to process much longer videos efficiently and accurately.
Contribution
ReTaKe introduces two novel modules, DPSelect and PivotKV, to jointly compress visual and knowledge redundancy without additional training.
Findings
Enables processing of 8 times longer video frames.
Outperforms similar-sized models by 3-5% on key benchmarks.
Reduces decoding latency by approximately 20%.
Abstract
Video Large Language Models (VideoLLMs) have made significant strides in video understanding but struggle with long videos due to the limitations of their backbone LLMs. Existing solutions rely on length extrapolation, which is memory-constrained, or visual token compression, which primarily leverages low-level temporal redundancy while overlooking the more effective high-level knowledge redundancy. To address this, we propose , a training-free method with two novel modules DPSelect and PivotKV, to jointly reduce both temporal visual redundancy and knowledge redundancy for video compression. To align with the way of human temporal perception, DPSelect identifies keyframes based on inter-frame distance peaks. To leverage LLMs' learned prior knowledge, PivotKV marks the keyframes as pivots and compress non-pivot frames by pruning low-attention tokens in their KV cache.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications
MethodsALIGN · Pruning · Focus
