LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang; Zehai He; Wenyi Hong; Yean Cheng; Xiaohan Zhang; Ji Qi; Xiaotao Gu; Shiyu Huang; Bin Xu; Yuxiao Dong; Ming Ding; Jie Tang

arXiv:2406.08035·cs.CV·August 12, 2025·1 cites

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang

PDF

Open Access 1 Repo 2 Datasets 4 Reviews

TL;DR

LVBench is a new benchmark designed to evaluate and advance multimodal models' ability to understand and extract information from long videos spanning several hours, addressing a gap in current short-video focused datasets.

Contribution

The paper introduces LVBench, a comprehensive long video understanding benchmark with diverse tasks, to evaluate and promote models capable of long-term memory and extended comprehension.

Findings

01

Current models underperform on long video tasks

02

LVBench reveals the need for models with better long-term memory

03

Dataset and code are publicly available for research use

Abstract

Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations reveal…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 4

Strengths

* Serveral baseline methods are adopted and tested on the new benchmark. * The quality control process is well designed to avoid the imbalance of question types and language bias.

Weaknesses

I have serveral concerns regarding the quality of the benchmark and insufficent exploration of the bottleneck of current video MLLMs. * Diversity of Videos: Although the benchmark includes 103 extensively long videos from YouTube, concerns remain regarding the diversity of scene categories, filming techniques, event relationships, and the variety of objects and motions depicted. adding a statistical chart to illustrate the diversity of the videos is necessary. * Double checking. The dataset wo

Reviewer 02Rating 3Confidence 5

Strengths

+ The paper introduces a video understanding benchmark that emphasizes long videos, longer than prior works. + The paper does a commendable job of enumerating different capabilities required for long video understanding + The paper evaluates a total of 15 different MLLMs, which is a reasonably comprehensive list.

Weaknesses

Benchmark: - While I appreciate the effort to create a benchmark with longer videos, as emphasized in the paper and Table 1, without enough scale, the benchmark can prove to be of limited utility. As noted in Table 1, the benchmark as 1549 QA pairs, making it the second smallest benchmark in terms of number of QA, with only the older ActivityNet-QA (Yu et al., 2019) having fewer QA pairs. It only has 103 videos, which is considered quite small by current standards. In fact, number of videos shou

Reviewer 03Rating 5Confidence 4

Strengths

- It is a new benchmark with longer videos than other existing datasets.

Weaknesses

- The annotation process is not 100% clear. The paper does not reveal the detailed process used to collect the annotations. For instance, what instructions were given to the annotators to prepare the question and the four options in multiple choice questions? Example annotations provided in the paper are also extremely limited (just one example in Figure 2 per type), making it very difficult to judge the quality of the dataset. We also do not see any supplementary material or appendix with such

Reviewer 04Rating 5Confidence 4

Strengths

1. Timely and useful benchmark for long-video QnA containing diverse and long videos 2. Paper is quite well written clearly outlining the reasoning for building this benchmark and how it differentiates from existing works. Table 1 in particular is quite useful for the latter. 3. Several interesting analysis on dataset statistics.

Weaknesses

1. L143: Consider adding hours, i.e. … 4101 seconds, … -> … 4101 seconds ( > one hour), … 2. **On L316 (answer matching to choices):** The authors use the following setup for selecting the correct choice with a given VLM: “After obtaining the model responses, we first attempted to extract the answers using regular expression matching. For questions where the matching process was unsuccessful, we employed a GLM-4 model to extract the answers from the responses.” 1. Could you try likelihoo

Code & Models

Repositories

THUDM/LVBench
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReal-time simulation and control systems

MethodsSparse Evolutionary Training