OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
Ruixiang Zhao, Jie Yang, Zijie Xin, Tianyi Wang, Fengyun Rao, Jing LYU, Xirong Li

TL;DR
OmniPro is a new benchmark designed to evaluate omni-modal large language models in proactive streaming video understanding, covering perception, response, and diverse tasks with comprehensive evaluation protocols.
Contribution
It introduces the first joint evaluation benchmark for omni-modal perception and proactive response in streaming videos, with a large dataset, detailed annotations, and dual-mode evaluation protocols.
Findings
Audio signals improve model performance but are used variably.
Model performance declines over time, showing limited long-term robustness.
Perception of non-speech audio remains a weak point.
Abstract
Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
