Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Jingyang Lin; Jialian Wu; Ximeng Sun; Ze Wang; Jiang Liu; Yusheng Su; Xiaodong Yu; Hao Chen; Jiebo Luo; Zicheng Liu; Emad Barsoum

arXiv:2506.05332·cs.CV·December 3, 2025

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Jingyang Lin, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Xiaodong Yu, Hao Chen, Jiebo Luo, Zicheng Liu, Emad Barsoum

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces VideoMarathon, a large-scale dataset of hour-long videos with QA pairs, and Hour-LLaVA, a new Video-LMM that effectively models long videos for improved understanding and benchmark performance.

Contribution

The paper presents a novel dataset for hour-long videos and a new model that enables efficient hour-scale video-language training and inference.

Findings

01

Hour-LLaVA achieves state-of-the-art results on long video benchmarks.

02

VideoMarathon significantly extends training video durations and task diversity.

03

The model effectively leverages memory augmentation for long-term video comprehension.

Abstract

Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LMMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

jylins/videomarathon
dataset· 216 dl
216 dl

Videos

Unleashing Hour-Scale Video Training for Long Video-Language Understanding· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition