VideoChat: Chat-Centric Video Understanding

KunChang Li; Yinan He; Yi Wang; Yizhuo Li; Wenhai Wang; Ping Luo; Yali; Wang; Limin Wang; Yu Qiao

arXiv:2305.06355·cs.CV·January 5, 2024·90 cites

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali, Wang, Limin Wang, Yu Qiao

PDF

Open Access 1 Repo 10 Models 1 Datasets

TL;DR

VideoChat is an innovative end-to-end system that combines video foundation models and large language models through a learnable interface, enabling advanced spatiotemporal reasoning, event localization, and causal inference in videos.

Contribution

The paper introduces a novel chat-centric video understanding system and a comprehensive video instruction dataset, advancing the integration of vision and language models for video analysis.

Findings

01

Preliminary experiments show promising capabilities in video understanding tasks.

02

The dataset emphasizes spatiotemporal reasoning and causal relationships.

03

The system demonstrates potential across diverse video applications.

Abstract

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opengvlab/ask-anything
pytorchOfficial

Models

Datasets

OpenGVLab/VideoChat2-IT
dataset· 400 dl
400 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling