Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification
Minghao Qin, Xiangrui Liu, Zhengyang Liang, Yan Shu, Huaying Yuan, Juenjie Zhou, Shitao Xiao, Bo Zhao, Zheng Liu

TL;DR
Video-XL-2 introduces a task-aware key-value sparsification method that significantly improves the efficiency and performance of long-video understanding in multi-modal large language models, enabling processing of thousands of frames with reduced computational costs.
Contribution
The paper presents a novel framework combining chunk-based pre-filling and bi-level key-value decoding to enhance long-video understanding efficiency and accuracy.
Findings
Achieves state-of-the-art results on long video benchmarks.
Processes over 10,000 frames on a single GPU.
Outperforms existing lightweight models in efficiency and accuracy.
Abstract
Multi-modal large language models (MLLMs) models have made significant progress in video understanding over the past few years. However, processing long video inputs remains a major challenge due to high memory and computational costs. This makes it difficult for current models to achieve both strong performance and high efficiency in long video understanding. To address this challenge, we propose Video-XL-2, a novel MLLM that delivers superior cost-effectiveness for long-video understanding based on task-aware KV sparsification. The proposed framework operates with two key steps: chunk-based pre-filling and bi-level key-value decoding. Chunk-based pre-filling divides the visual token sequence into chunks, applying full attention within each chunk and sparse attention across chunks. This significantly reduces computational and memory overhead. During decoding, bi-level key-value…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Video Coding and Compression Technologies · Medical Imaging Techniques and Applications
