InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding   and Generation

Yi Wang; Yinan He; Yizhuo Li; Kunchang Li; Jiashuo Yu; Xin Ma; Xinhao; Li; Guo Chen; Xinyuan Chen; Yaohui Wang; Conghui He; Ping Luo; Ziwei Liu,; Yali Wang; Limin Wang; Yu Qiao

arXiv:2307.06942·cs.CV·January 5, 2024·32 cites

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao, Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu,, Yali Wang, Limin Wang, Yu Qiao

PDF

Open Access 1 Repo 5 Models 4 Datasets 1 Video

TL;DR

InternVid is a massive video-text dataset created using large language models, enabling advanced multimodal understanding and generation, and supporting diverse applications like video recognition, retrieval, and dialogue systems.

Contribution

The paper presents a scalable method to autonomously build a large-scale video-text dataset using LLMs and introduces ViCLIP, a new video-text representation model trained on InternVid.

Findings

01

ViCLIP achieves state-of-the-art zero-shot action recognition.

02

InternVid contains over 7 million videos and 4.1 billion words of descriptions.

03

The dataset and model support diverse multimodal video understanding tasks.

Abstract

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opengvlab/internvideo
pytorchOfficial

Models

Datasets

Videos

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization