Online Video Understanding: OVBench and VideoChat-Online

Zhenpeng Huang; Xinhao Li; Jiaqi Li; Jing Wang; Xiangyu Zeng; Cheng; Liang; Tao Wu; Xi Chen; Liang Li; Limin Wang

arXiv:2501.00584·cs.CV·April 18, 2025

Online Video Understanding: OVBench and VideoChat-Online

Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng, Liang, Tao Wu, Xi Chen, Liang Li, Limin Wang

PDF

Open Access 1 Repo 1 Models 2 Datasets

TL;DR

This paper introduces OVBench, a new benchmark for online video understanding, along with a novel model architecture and training strategy, culminating in VideoChat-Online, which excels in real-time video tasks with improved efficiency and performance.

Contribution

It presents a comprehensive evaluation benchmark, a new memory-augmented model architecture, and an innovative training paradigm for online video understanding.

Findings

01

VideoChat-Online outperforms state-of-the-art models on OVBench and offline benchmarks.

02

The Pyramid Memory Bank effectively retains key spatiotemporal information.

03

The proposed training strategy enhances online video understanding efficiency.

Abstract

Multimodal Large Language Models (MLLMs) have significantly progressed in offline video understanding. However, applying these models to real-world scenarios, such as autonomous driving and human-computer interaction, presents unique challenges due to the need for real-time processing of continuous online video streams. To this end, this paper presents systematic efforts from three perspectives: evaluation benchmark, model architecture, and training strategy. First, we introduce OVBench, a comprehensive question-answering benchmark designed to evaluate models' ability to perceive, memorize, and reason within online video contexts. It features 6 core task types across three temporal contexts-past, current, and future-forming 16 subtasks from diverse datasets. Second, we propose a new Pyramid Memory Bank (PMB) that effectively retains key spatiotemporal information in video streams.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MCG-NJU/VideoChat-Online
pytorch

Models

🤗
MCG-NJU/VideoChatOnline-4B
model· 10 dl· ♡ 2
10 dl♡ 2

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization