LvBench: A Benchmark for Long-form Video Understanding with Versatile Multi-modal Question Answering
Hongjie Zhang, Lu Dong, Yi Liu, Yifei Huang, Yali Wang, Limin Wang, Yu Qiao

TL;DR
LvBench is a comprehensive benchmark for long-form video understanding that features extended videos, diverse question types, and high-quality annotations to challenge and evaluate multi-modal VideoQA methods.
Contribution
The paper introduces LvBench, a new long-form VideoQA dataset with longer videos, diverse questions, and rigorous annotations, addressing limitations of previous datasets.
Findings
Existing methods perform worse on longer videos.
LvBench covers videos from 70 seconds to 4 hours.
Diverse question types reveal different perceptual and cognitive challenges.
Abstract
Despite remarkable recent progress, existing long-form VideoQA datasets fall short of meeting the criteria for genuine long-form video understanding. This is primarily due to the use of short videos for question curation, and the reliance on limited-length sub-clips as clues to answer those questions. Meanwhile, previous datasets have limited focus on question type and modality. To remedy this, we introduce LvBench, a Long-form video understanding benchmark for versatile multi-modal question-answering. Our LvBench stands out from existing long-form VideoQA datasets through three key characteristics: 1) Extended temporal durations: We consider videos ranging from 70 seconds to 4 hours, covering single-scene, multi-scene, and full-scene contexts. This design accounts for both video and clue lengths, capturing diverse contextual dynamics. 2) Diverse question types and modalities: LvBench…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
