VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs

Naishan Zheng; Jie Huang; Qingpei Guo; Feng Zhao

arXiv:2512.22226·cs.CV·December 30, 2025

VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs

Naishan Zheng, Jie Huang, Qingpei Guo, Feng Zhao

PDF

Open Access

TL;DR

VideoScaffold introduces a dynamic, hierarchical framework for streaming video understanding that adaptively refines event boundaries and aggregates semantics, enabling continuous, coherent comprehension with state-of-the-art results.

Contribution

It proposes a novel elastic-scale, hierarchical approach with EES and HEC components for real-time video understanding, addressing limitations of static methods.

Findings

01

Achieves state-of-the-art performance on streaming video benchmarks.

02

Seamlessly extends image-based MLLMs to continuous video comprehension.

03

Demonstrates effectiveness in both offline and streaming settings.

Abstract

Understanding long videos with multimodal large language models (MLLMs) remains challenging due to the heavy redundancy across frames and the need for temporally coherent representations. Existing static strategies, such as sparse sampling, frame compression, and clustering, are optimized for offline settings and often produce fragmented or over-compressed outputs when applied to continuous video streams. We present VideoScaffold, a dynamic representation framework designed for streaming video understanding. It adaptively adjusts event granularity according to video duration while preserving fine-grained visual semantics. VideoScaffold introduces two key components: Elastic-Scale Event Segmentation (EES), which performs prediction-guided segmentation to dynamically refine event boundaries, and Hierarchical Event Consolidation (HEC), which progressively aggregates semantically related…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis