VISTA: Enhancing Long-Duration and High-Resolution Video Understanding   by Video Spatiotemporal Augmentation

Weiming Ren; Huan Yang; Jie Min; Cong Wei; Wenhu Chen

arXiv:2412.00927·cs.CV·December 3, 2024

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Weiming Ren, Huan Yang, Jie Min, Cong Wei, Wenhu Chen

PDF

Open Access 3 Models 2 Datasets

TL;DR

VISTA introduces a data-centric video augmentation framework that synthesizes long-duration and high-resolution videos to improve large multimodal models' understanding, leading to significant performance gains on new benchmarks.

Contribution

The paper presents VISTA, a novel spatiotemporal augmentation method and dataset that enhance long-duration and high-resolution video understanding in multimodal models.

Findings

01

3.3% average improvement on four benchmarks

02

6.5% performance gain on HRVideoBench

03

Effective augmentation for long and high-res videos

Abstract

Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective Video Spatiotemporal Augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Visual Attention and Saliency Detection · Image Processing Techniques and Applications