StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video   Understanding

Junming Lin; Zheng Fang; Chi Chen; Zihao Wan; Fuwen Luo; Peng Li; Yang; Liu; Maosong Sun

arXiv:2411.03628·cs.CV·November 7, 2024

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding

Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang, Liu, Maosong Sun

PDF

Open Access 1 Repo 1 Datasets

TL;DR

StreamingBench is a new benchmark that evaluates the real-time streaming video understanding capabilities of Multimodal Large Language Models, revealing significant gaps compared to human performance and guiding future improvements.

Contribution

This paper introduces StreamingBench, the first comprehensive benchmark for assessing streaming video understanding in MLLMs across multiple core aspects.

Findings

01

Most advanced MLLMs perform significantly below human-level in streaming scenarios.

02

Current models struggle with real-time visual, omni-source, and contextual understanding.

03

Benchmark includes 18 tasks with 900 videos and 4,500 QA pairs.

Abstract

The rapid development of Multimodal Large Language Models (MLLMs) has expanded their capabilities from image comprehension to video understanding. However, most of these MLLMs focus primarily on offline video comprehension, necessitating extensive processing of all video frames before any queries can be made. This presents a significant gap compared to the human ability to watch, listen, think, and respond to streaming inputs in real time, highlighting the limitations of current MLLMs. In this paper, we introduce StreamingBench, the first comprehensive benchmark designed to evaluate the streaming video understanding capabilities of MLLMs. StreamingBench assesses three core aspects of streaming video understanding: (1) real-time visual understanding, (2) omni-source understanding, and (3) contextual understanding. The benchmark consists of 18 tasks, featuring 900 videos and 4,500…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp-mt/streamingbench
pytorchOfficial

Datasets

mjuicem/StreamingBench
dataset· 3.0k dl
3.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsFocus