LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu and, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan, Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, and Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas, Chandra

TL;DR
LongVU introduces a spatiotemporal adaptive compression method that reduces video tokens by leveraging cross-modal queries and inter-frame dependencies, enabling efficient processing of long videos with minimal information loss.
Contribution
The paper presents a novel adaptive compression mechanism for long videos that preserves visual details while significantly reducing token redundancy, improving video understanding performance.
Findings
Outperforms existing methods on various benchmarks.
Effectively processes hour-long videos with minimal information loss.
Scales well with smaller language models.
Abstract
Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Vision-CAIR/LongVU_Qwen2_7Bmodel· 168 dl· ♡ 74168 dl♡ 74
- 🤗Vision-CAIR/LongVU_Llama3_2_3Bmodel· 22 dl· ♡ 822 dl♡ 8
- 🤗Vision-CAIR/LongVU_Llama3_2_1Bmodel· 19 dl· ♡ 1219 dl♡ 12
- 🤗jadechoghari/LongVU_Qwen2_7Bmodel· 9 dl· ♡ 19 dl♡ 1
- 🤗jadechoghari/LongVU_Llama3_2_1Bmodel· 1 dl1 dl
- 🤗jadechoghari/LongVU_Llama3_2_3Bmodel· 1 dl1 dl
- 🤗tcm03/LongVidLLaMAmodel· 7 dl· ♡ 17 dl♡ 1
- 🤗Lu9876/VideoTG_R1model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Data Compression Techniques · Multimodal Machine Learning Applications
