VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs
Naishan Zheng, Jie Huang, Qingpei Guo, Feng Zhao

TL;DR
VideoScaffold introduces a dynamic, hierarchical framework for streaming video understanding that adaptively refines event boundaries and aggregates semantics, enabling continuous, coherent comprehension with state-of-the-art results.
Contribution
It proposes a novel elastic-scale, hierarchical approach with EES and HEC components for real-time video understanding, addressing limitations of static methods.
Findings
Achieves state-of-the-art performance on streaming video benchmarks.
Seamlessly extends image-based MLLMs to continuous video comprehension.
Demonstrates effectiveness in both offline and streaming settings.
Abstract
Understanding long videos with multimodal large language models (MLLMs) remains challenging due to the heavy redundancy across frames and the need for temporally coherent representations. Existing static strategies, such as sparse sampling, frame compression, and clustering, are optimized for offline settings and often produce fragmented or over-compressed outputs when applied to continuous video streams. We present VideoScaffold, a dynamic representation framework designed for streaming video understanding. It adaptively adjusts event granularity according to video duration while preserving fine-grained visual semantics. VideoScaffold introduces two key components: Elastic-Scale Event Segmentation (EES), which performs prediction-guided segmentation to dynamically refine event boundaries, and Hierarchical Event Consolidation (HEC), which progressively aggregates semantically related…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
