Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities   in Large Vision-Language Models

Shimin Chen; Yitian Yuan; Shaoxiang Chen; Zequn Jie; Lin Ma

arXiv:2406.08024·cs.CV·June 13, 2024

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

Shimin Chen, Yitian Yuan, Shaoxiang Chen, Zequn Jie, Lin Ma

PDF

Open Access

TL;DR

This paper introduces a cost-effective video-LVLM that leverages image-video commonalities, reduces computational costs, and achieves strong performance using only 10% of traditional video data, emphasizing temporal understanding.

Contribution

We develop a novel, efficient video-LVLM architecture with innovative training strategies and a weighted token sampler, enabling high performance with significantly less video data.

Findings

01

Using 10% of video data yields comparable results to full datasets.

02

Weighted token sampling reduces computational costs substantially.

03

Incorporating temporal-focused video data improves model performance.

Abstract

Amidst the advancements in image-based Large Vision-Language Models (image-LVLM), the transition to video-based models (video-LVLM) is hindered by the limited availability of quality video data. This paper addresses the challenge by leveraging the visual commonalities between images and videos to efficiently evolve image-LVLMs into video-LVLMs. We present a cost-effective video-LVLM that enhances model architecture, introduces innovative training strategies, and identifies the most effective types of video instruction data. Our innovative weighted token sampler significantly compresses the visual token numbers of each video frame, effectively cutting computational expenses. We also find that judiciously using just 10% of the video data, compared to prior video-LVLMs, yields impressive results during various training phases. Moreover, we delve into the influence of video instruction data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications