LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang

TL;DR
LVBench is a new benchmark designed to evaluate and advance multimodal models' ability to understand and extract information from long videos spanning several hours, addressing a gap in current short-video focused datasets.
Contribution
The paper introduces LVBench, a comprehensive long video understanding benchmark with diverse tasks, to evaluate and promote models capable of long-term memory and extended comprehension.
Findings
Current models underperform on long video tasks
LVBench reveals the need for models with better long-term memory
Dataset and code are publicly available for research use
Abstract
Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations reveal…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
* Serveral baseline methods are adopted and tested on the new benchmark. * The quality control process is well designed to avoid the imbalance of question types and language bias.
I have serveral concerns regarding the quality of the benchmark and insufficent exploration of the bottleneck of current video MLLMs. * Diversity of Videos: Although the benchmark includes 103 extensively long videos from YouTube, concerns remain regarding the diversity of scene categories, filming techniques, event relationships, and the variety of objects and motions depicted. adding a statistical chart to illustrate the diversity of the videos is necessary. * Double checking. The dataset wo
+ The paper introduces a video understanding benchmark that emphasizes long videos, longer than prior works. + The paper does a commendable job of enumerating different capabilities required for long video understanding + The paper evaluates a total of 15 different MLLMs, which is a reasonably comprehensive list.
Benchmark: - While I appreciate the effort to create a benchmark with longer videos, as emphasized in the paper and Table 1, without enough scale, the benchmark can prove to be of limited utility. As noted in Table 1, the benchmark as 1549 QA pairs, making it the second smallest benchmark in terms of number of QA, with only the older ActivityNet-QA (Yu et al., 2019) having fewer QA pairs. It only has 103 videos, which is considered quite small by current standards. In fact, number of videos shou
- It is a new benchmark with longer videos than other existing datasets.
- The annotation process is not 100% clear. The paper does not reveal the detailed process used to collect the annotations. For instance, what instructions were given to the annotators to prepare the question and the four options in multiple choice questions? Example annotations provided in the paper are also extremely limited (just one example in Figure 2 per type), making it very difficult to judge the quality of the dataset. We also do not see any supplementary material or appendix with such
1. Timely and useful benchmark for long-video QnA containing diverse and long videos 2. Paper is quite well written clearly outlining the reasoning for building this benchmark and how it differentiates from existing works. Table 1 in particular is quite useful for the latter. 3. Several interesting analysis on dataset statistics.
1. L143: Consider adding hours, i.e. … 4101 seconds, … -> … 4101 seconds ( > one hour), … 2. **On L316 (answer matching to choices):** The authors use the following setup for selecting the correct choice with a given VLM: “After obtaining the model responses, we first attempted to extract the answers using regular expression matching. For questions where the matching process was unsuccessful, we employed a GLM-4 model to extract the answers from the responses.” 1. Could you try likelihoo
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReal-time simulation and control systems
MethodsSparse Evolutionary Training
