VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Yunxin Li; Xinyu Chen; Baotian Hu; Longyue Wang; Haoyuan Shi; Min; Zhang

arXiv:2406.11303·cs.CV·June 18, 2024

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, Min, Zhang

PDF

Open Access 1 Repo 5 Datasets

TL;DR

VideoVista is a comprehensive benchmark designed to evaluate large multimodal models' performance across diverse video understanding and reasoning tasks, highlighting current limitations and guiding future improvements.

Contribution

The paper introduces VideoVista, a versatile video QA benchmark with 25,000 questions from 3,400 videos across 14 categories, and an automatic data construction framework utilizing GPT-4o.

Findings

01

Video-LMMs struggle with temporal location, object tracking, and anomaly detection.

02

Video-LMMs have weaker logical and relation reasoning abilities.

03

Open-source Video-LMMs perform 20 points lower than GPT-4o and Gemini-1.5.

Abstract

Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations, and abilities. Specifically, VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes. Besides, it encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning). To achieve this, we present an automatic data construction framework, leveraging powerful GPT-4o alongside…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hitsz-tmg/videovista
none

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications