OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video   Understanding?

Yifei Li; Junbo Niu; Ziyang Miao; Chunjiang Ge; Yuanhang Zhou; Qihao; He; Xiaoyi Dong; Haodong Duan; Shuangrui Ding; Rui Qian; Pan Zhang; Yuhang; Zang; Yuhang Cao; Conghui He; Jiaqi Wang

arXiv:2501.05510·cs.CV·March 28, 2025

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao, He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang, Zang, Yuhang Cao, Conghui He, Jiaqi Wang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

OVO-Bench is a new benchmark designed to evaluate online video language models' ability to understand and reason about videos in real-time, focusing on temporal awareness and dynamic response scenarios.

Contribution

The paper introduces OVO-Bench, a comprehensive benchmark with new tasks and annotations specifically for assessing online video understanding in LLMs, addressing a gap in existing evaluations.

Findings

01

Current models underperform on online video understanding tasks.

02

Significant gap between model performance and human-level reasoning.

03

Benchmark reveals models' struggles with temporal reasoning and real-time comprehension.

Abstract

Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

joeleelyf/ovo-bench
pytorchOfficial

Datasets

JoeLeelyf/OVO-Bench
dataset· 3.1k dl
3.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization

MethodsHigh-Order Consensuses