HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

Heqing Zou; Tianze Luo; Guiyang Xie; Victor Xiao Jie Zhang; Fengmao Lv; Guangcong Wang; Junyang Chen; Zhuochen Wang; Hansheng Zhang; Huaijian Zhang

arXiv:2501.01645·cs.CV·May 14, 2025

HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

Heqing Zou, Tianze Luo, Guiyang Xie, Victor Xiao Jie Zhang, Fengmao Lv, Guangcong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang, Huaijian Zhang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

HLV-1K is a comprehensive large-scale benchmark dataset designed to evaluate models' ability to understand hour-long videos through diverse question-answering tasks, addressing a significant gap in long video understanding research.

Contribution

The paper introduces HLV-1K, the first large-scale, hour-long video dataset with extensive annotations for evaluating long video understanding models.

Findings

01

Existing models are tested on HLV-1K, revealing challenges in long-term video comprehension.

02

HLV-1K enables detailed evaluation of models across multiple reasoning levels.

03

The benchmark promotes development of more effective long video understanding methods.

Abstract

Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vincent-zhq/hlv-1k
noneOfficial

Datasets

ZouHQ/HLV-1K
dataset· 92 dl
92 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Video Surveillance and Tracking Methods · Human Pose and Action Recognition

MethodsFocus