InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao,, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan,, Jiashuo Yu, Yali Wang, Limin Wang, Yu Qiao

TL;DR
InternVideo introduces a unified video foundation model that combines generative and discriminative self-supervised learning to achieve state-of-the-art results across diverse video understanding tasks and datasets.
Contribution
The paper proposes a novel approach integrating masked video modeling and video-language contrastive learning for comprehensive video foundation modeling.
Findings
Achieves 91.1% top-1 accuracy on Kinetics-400
Achieves 77.2% top-1 accuracy on Something-Something V2
Sets new state-of-the-art across 39 video datasets
Abstract
The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗OpenGVLab/ViCLIPmodel· ♡ 48♡ 48
- 🤗OpenGVLab/InternVideo2-Stage2_1B-224p-f4model· ♡ 22♡ 22
- 🤗OpenGVLab/InternVideo2-CLIP-1B-224p-f8model· ♡ 5♡ 5
- 🤗OpenGVLab/InternVideo2-Stage1-1B-224p-f8model· ♡ 6♡ 6
- 🤗OpenGVLab/InternVideo2-Stage1-1B-224p-f8-k710model· ♡ 1♡ 1
- 🤗OpenGVLab/InternVideo2-Stage1-1B-224p-K400model· ♡ 4♡ 4
- 🤗OpenGVLab/InternVideo2-Stage1-1B-224p-K600model· ♡ 1♡ 1
- 🤗OpenGVLab/InternVideo2-Stage1-1B-224p-K700model· ♡ 1♡ 1
- 🤗OpenGVLab/InternVideo2-Stage1-1B-224p-f8-SthSthmodel
- 🤗OpenGVLab/InternVideo2-Stage1-1B-224p-f8-MiTmodel· ♡ 2♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning · InternVideo: General Video Foundation Models via Generative and Discriminative Learning
