LongVU: Spatiotemporal Adaptive Compression for Long Video-Language   Understanding

Xiaoqian Shen; Yunyang Xiong; Changsheng Zhao; Lemeng Wu and; Jun Chen; Chenchen Zhu; Zechun Liu; Fanyi Xiao; Balakrishnan; Varadarajan; Florian Bordes; Zhuang Liu; Hu Xu; Hyunwoo J. Kim; and Bilge Soran; Raghuraman Krishnamoorthi; Mohamed Elhoseiny; Vikas; Chandra

arXiv:2410.17434·cs.CV·October 24, 2024

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu and, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan, Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, and Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas, Chandra

PDF

Open Access 1 Repo 8 Models

TL;DR

LongVU introduces a spatiotemporal adaptive compression method that reduces video tokens by leveraging cross-modal queries and inter-frame dependencies, enabling efficient processing of long videos with minimal information loss.

Contribution

The paper presents a novel adaptive compression mechanism for long videos that preserves visual details while significantly reducing token redundancy, improving video understanding performance.

Findings

01

Outperforms existing methods on various benchmarks.

02

Effectively processes hour-long videos with minimal information loss.

03

Scales well with smaller language models.

Abstract

Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Vision-CAIR/LongVU
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Data Compression Techniques · Multimodal Machine Learning Applications