Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Han Wang; Yuxiang Nie; Yongjie Ye; Deng GuanYu; Yanjie Wang; Shuai Li,; Haiyang Yu; Jinghui Lu; Can Huang

arXiv:2412.09530·cs.CV·December 13, 2024

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li,, Haiyang Yu, Jinghui Lu, Can Huang

PDF

Open Access 1 Repo

TL;DR

Dynamic-VLM introduces a novel visual token compression method for VideoLLMs, enabling efficient video analysis with state-of-the-art performance and better generalization across multiple video understanding tasks.

Contribution

It proposes a dynamic visual token compression architecture and a synthetic dataset, advancing VideoLLMs' efficiency and performance beyond existing models.

Findings

01

Achieves state-of-the-art results on various video tasks.

02

Improves performance by 2.7% on VideoMME and 10.7% on MuirBench.

03

Demonstrates strong generalization capabilities.

Abstract

The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos. In this study, we introduce a large-scale synthetic dataset created from proprietary models, using carefully designed prompts to tackle a wide range of questions. We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance. Our proposed \model{} achieves state-of-the-art results across various video tasks and shows impressive generalization, setting new baselines in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hon-wong/bytevideollm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Steganography and Watermarking Techniques · Image Processing Techniques and Applications