RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

Shuhang Xun; Sicheng Tao; Jungang Li; Yibo Shi; Zhixin Lin; Zhanhui Zhu; Yibo Yan; Hanqian Li; Linghao Zhang; Shikang Wang; Yixin Liu; Hanbo Zhang; Ying Ma; Xuming Hu

arXiv:2505.02064·cs.CV·January 16, 2026

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, Xuming Hu

PDF

Open Access 1 Repo

TL;DR

RTV-Bench is a comprehensive benchmark designed to evaluate multimodal large language models on continuous, real-time video streams, highlighting current limitations and guiding future improvements in dynamic video understanding.

Contribution

We introduce RTV-Bench, a detailed benchmark with multi-timestamp questions and hierarchical structures to assess MLLMs on real-time video analysis, covering perception, understanding, and reasoning.

Findings

01

Real-time models outperform offline models but still lag behind proprietary systems.

02

Scaling model size improves performance, but increasing input frame density does not always help.

03

Current architectures face limitations in handling long-horizon video streams.

Abstract

Multimodal Large Language Models (MLLMs) have made rapid progress in perception, understanding, and reasoning, yet existing benchmarks fall short in evaluating these abilities under continuous and dynamic real-world video streams. Such settings require models to maintain coherent understanding and reasoning as visual scenes evolve over time. **We introduce RTV-Bench, a fine-grained benchmark for real-time video analysis with MLLMs**. It is built upon three key principles: multi-timestamp question answering, hierarchical question structures spanning perception and reasoning, and multi-dimensional evaluation of continuous perception, understanding, and reasoning. RTV-Bench comprises 552 diverse videos and 4,608 carefully curated QA pairs covering a wide range of dynamic scenarios. We evaluate a broad range of state-of-the-art MLLMs, including proprietary, open-source offline, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ljungang/rtv-bench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Semantic Web and Ontologies · Machine Learning and Data Classification