OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Ruixiang Zhao; Jie Yang; Zijie Xin; Tianyi Wang; Fengyun Rao; Jing LYU; Xirong Li

arXiv:2605.18577·cs.CV·May 19, 2026

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Ruixiang Zhao, Jie Yang, Zijie Xin, Tianyi Wang, Fengyun Rao, Jing LYU, Xirong Li

PDF

1 Repo 1 Datasets

TL;DR

OmniPro is a new benchmark designed to evaluate omni-modal large language models in proactive streaming video understanding, covering perception, response, and diverse tasks with comprehensive evaluation protocols.

Contribution

It introduces the first joint evaluation benchmark for omni-modal perception and proactive response in streaming videos, with a large dataset, detailed annotations, and dual-mode evaluation protocols.

Findings

01

Audio signals improve model performance but are used variably.

02

Model performance declines over time, showing limited long-term robustness.

03

Perception of non-speech audio remains a weak point.

Abstract

Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ruixiangzhao/OmniPro
github

Datasets

RuixiangZhao/OmniPro
dataset· 1.5k dl
1.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.