StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding
Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang, Liu, Maosong Sun

TL;DR
StreamingBench is a new benchmark that evaluates the real-time streaming video understanding capabilities of Multimodal Large Language Models, revealing significant gaps compared to human performance and guiding future improvements.
Contribution
This paper introduces StreamingBench, the first comprehensive benchmark for assessing streaming video understanding in MLLMs across multiple core aspects.
Findings
Most advanced MLLMs perform significantly below human-level in streaming scenarios.
Current models struggle with real-time visual, omni-source, and contextual understanding.
Benchmark includes 18 tasks with 900 videos and 4,500 QA pairs.
Abstract
The rapid development of Multimodal Large Language Models (MLLMs) has expanded their capabilities from image comprehension to video understanding. However, most of these MLLMs focus primarily on offline video comprehension, necessitating extensive processing of all video frames before any queries can be made. This presents a significant gap compared to the human ability to watch, listen, think, and respond to streaming inputs in real time, highlighting the limitations of current MLLMs. In this paper, we introduce StreamingBench, the first comprehensive benchmark designed to evaluate the streaming video understanding capabilities of MLLMs. StreamingBench assesses three core aspects of streaming video understanding: (1) real-time visual understanding, (2) omni-source understanding, and (3) contextual understanding. The benchmark consists of 18 tasks, featuring 900 videos and 4,500…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsFocus
