ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
David Ma, Huaqing Yuan, Xingjian Wang, Qianbo Zang, Tianci Liu, Xinyang He, Yanbin Wei, Jiawei Guo, Ni Jiahui, Zhenzhu Yang, Meng Cao, Shanghaoran Quan, Yizhi Li, Wangchunshu Zhou, Jiaheng Liu, Wenhao Huang, Ge Zhang, Shiwen Ni, and Xiaojie Jin

TL;DR
ScaleLong introduces a comprehensive benchmark with multi-timescale questions within the same long videos, enabling direct performance comparison of models across different temporal levels for improved long-video understanding.
Contribution
It is the first benchmark to embed hierarchical timescale questions within the same videos, facilitating multi-scale evaluation of models on long-video understanding.
Findings
Models show a U-shaped performance curve across timescales.
Increasing visual token capacity improves reasoning at all timescales.
ScaleLong enables detailed analysis of model performance across hierarchical temporal levels.
Abstract
Although long-video understanding demands that models capture hierarchical temporal information -- from clip (seconds) and shot (tens of seconds) to event (minutes) and story (hours) -- existing benchmarks either neglect this multi-scale design or scatter scale-specific questions across different videos, preventing direct comparison of model performance across timescales on the same content. To address this, we introduce ScaleLong, the first benchmark to disentangle these factors by embedding questions targeting four hierarchical timescales -- clip (seconds), shot (tens of seconds), event (minutes), and story (hours) -- all within the same video content. This within-content multi-timescale questioning design enables direct comparison of model performance across timescales on identical videos. ScaleLong features 269 long videos (avg.\ 86\,min) from 5 main categories and 36…
Peer Reviews
Decision·ICLR 2026 Poster
Novel Temporal Framework The paper’s hierarchical division of video understanding into clip-, shot-, event-, and story-level timescales is a novel and insightful framework that enables fine-grained analysis of temporal reasoning abilities within a single benchmark. Interesting ablation studies The ablation experiments are thoughtfully designed to show interesting insights about the effects of visual token allocation and temporal coverage Rigorous Dataset Construction The dataset creation proce
Limited Illustrative Examples The paper provides only a small number of examples from the benchmark, which makes it difficult to fully appreciate the nuances of question design across different time scales. Inconsistent Story-Level Definition Although the paper claims that story-level questions require holistic narrative understanding, the provided example of a story level question and answers focuses on a sequential event listing rather than true integration of information from the entire vide
1) Comprehensive benchmark to embed questions at four hierarchical temporal scales within identical video content, enabling within-content comparison that isolates timescale effects from content variability 2) Section 4.3.2 and Figure 3(c) analyze different frame-resolution combinations described as under "fixed visual-token budget," revealing timescale-dependent optimal allocations (Clip benefits from many low-res frames; Story from balanced configurations) 3) Ten predefined distractor types en
1) Humans evaluated on "Whole Video" (continuous playback, ~150,000 frames) while models receive sparse samples (32-256 frames). The 23.1-point human-model gap may conflate access differences with capability differences , thus making it hard to interpret. 2) Given 5 task families × 4 scales × many categories, some slices will be small. The authors make strong qualitative claims (e.g., U-shape) but the paper has no per-slice CIs/bootstraps and no inter-annotator agreement figures for annotation
1. The motivation for this study is sound, and the comprehensive measurement of the model's ability to perceive and understand temporal information at different scales in long videos is meaningful. 2. The paper is clearly organized and easy to understand.
1. Long videos collected from YouTube, particularly TV and sports videos, often feature famous events. The training corpus for MLLMs may contain information about these events. How can the authors ensure that the questions they ask are not answered by the inherent knowledge contained in the model? 2. Where do the four time scales of Clip (less than 3 seconds), Shot (4-15 seconds), Event (16-10 minutes), and Story (more than 10 minutes) and their respective time range division standards come fro
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
MethodsContrastive Language-Image Pre-training
