Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu, Yuan, Dongdong Chen, Li Yuan

TL;DR
Video-Bench provides a comprehensive evaluation framework and toolkit for assessing Video-LLMs across understanding, reasoning, and decision-making tasks, revealing current models' limitations in human-like video comprehension.
Contribution
The paper introduces Video-Bench, a new benchmark and toolkit for systematically evaluating Video-LLMs' capabilities across multiple levels and tasks.
Findings
Current Video-LLMs underperform in human-like comprehension.
Video-Bench covers diverse tasks for comprehensive evaluation.
Toolkit automates metric calculation and scoring.
Abstract
Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries. In pursuit of the ultimate goal of achieving artificial general intelligence, a truly intelligent Video-LLM model should not only see and understand the surroundings, but also possess human-level commonsense, and make well-informed decisions for the users. To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial. To this end, this paper proposes \textit{Video-Bench}, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs. The benchmark comprises 10 meticulously crafted tasks, evaluating the capabilities of Video-LLMs across three distinct levels: Video-exclusive Understanding, Prior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
