VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, Min, Zhang

TL;DR
VideoVista is a comprehensive benchmark designed to evaluate large multimodal models' performance across diverse video understanding and reasoning tasks, highlighting current limitations and guiding future improvements.
Contribution
The paper introduces VideoVista, a versatile video QA benchmark with 25,000 questions from 3,400 videos across 14 categories, and an automatic data construction framework utilizing GPT-4o.
Findings
Video-LMMs struggle with temporal location, object tracking, and anomaly detection.
Video-LMMs have weaker logical and relation reasoning abilities.
Open-source Video-LMMs perform 20 points lower than GPT-4o and Gemini-1.5.
Abstract
Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations, and abilities. Specifically, VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes. Besides, it encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning). To achieve this, we present an automatic data construction framework, leveraging powerful GPT-4o alongside…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications
