VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali, Wang, Limin Wang, Yu Qiao

TL;DR
VideoChat is an innovative end-to-end system that combines video foundation models and large language models through a learnable interface, enabling advanced spatiotemporal reasoning, event localization, and causal inference in videos.
Contribution
The paper introduces a novel chat-centric video understanding system and a comprehensive video instruction dataset, advancing the integration of vision and language models for video analysis.
Findings
Preliminary experiments show promising capabilities in video understanding tasks.
The dataset emphasizes spatiotemporal reasoning and causal relationships.
The system demonstrates potential across diverse video applications.
Abstract
In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗OpenGVLab/VideoChat2_stage2_Mistral_7Bmodel· ♡ 2♡ 2
- 🤗OpenGVLab/VideoChat2_stage3_Mistral_7Bmodel· ♡ 4♡ 4
- 🤗OpenGVLab/VideoChat2_stage3_Phi3model
- 🤗OpenGVLab/VideoChat2_stage2_Phi3model
- 🤗OpenGVLab/VideoChat2_HD_stage4_Mistral_7Bmodel· ♡ 1♡ 1
- 🤗Andy1621/VideoChat2_VicunaV0_7B_stage3_noLoRAmodel
- 🤗OpenGVLab/InternVideo2-Chat-8Bmodel· 325 dl· ♡ 26325 dl♡ 26
- 🤗OpenGVLab/InternVideo2_chat_8B_HDmodel· ♡ 18♡ 18
- 🤗OpenGVLab/InternVideo2_Chat_8B_InternLM2_5model· 35 dl· ♡ 835 dl♡ 8
- 🤗OpenGVLab/VideoChat2_HD_stage4_Mistral_7B_hfmodel· 52 dl· ♡ 352 dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling
