Online Video Understanding: OVBench and VideoChat-Online
Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng, Liang, Tao Wu, Xi Chen, Liang Li, Limin Wang

TL;DR
This paper introduces OVBench, a new benchmark for online video understanding, along with a novel model architecture and training strategy, culminating in VideoChat-Online, which excels in real-time video tasks with improved efficiency and performance.
Contribution
It presents a comprehensive evaluation benchmark, a new memory-augmented model architecture, and an innovative training paradigm for online video understanding.
Findings
VideoChat-Online outperforms state-of-the-art models on OVBench and offline benchmarks.
The Pyramid Memory Bank effectively retains key spatiotemporal information.
The proposed training strategy enhances online video understanding efficiency.
Abstract
Multimodal Large Language Models (MLLMs) have significantly progressed in offline video understanding. However, applying these models to real-world scenarios, such as autonomous driving and human-computer interaction, presents unique challenges due to the need for real-time processing of continuous online video streams. To this end, this paper presents systematic efforts from three perspectives: evaluation benchmark, model architecture, and training strategy. First, we introduce OVBench, a comprehensive question-answering benchmark designed to evaluate models' ability to perceive, memorize, and reason within online video contexts. It features 6 core task types across three temporal contexts-past, current, and future-forming 16 subtasks from diverse datasets. Second, we propose a new Pyramid Memory Bank (PMB) that effectively retains key spatiotemporal information in video streams.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
