Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li, Kunchang Li, Yinan He, Yi Wang, Yali Wang, Limin Wang, Yu Qiao, Ping Luo

TL;DR
This paper introduces an efficient post-pretraining method to convert image foundation models into video models by simple input modifications, significantly reducing training costs while achieving state-of-the-art results on various video-language tasks.
Contribution
The authors propose a novel, simple post-pretraining framework that leverages patch dropping and text masking to efficiently adapt image models for video tasks, reducing training time and data requirements.
Findings
Achieves state-of-the-art performance on multiple video-language benchmarks.
Training can be completed in less than one day on 8 GPUs.
Requires only WebVid-10M data for pretraining.
Abstract
Building video-language foundation models is costly and difficult due to the redundant nature of video data and the lack of high-quality video-language datasets. In this paper, we propose an efficient framework to harvest video foundation models from image ones. Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure. The patch dropping boosts the training efficiency significantly and text masking enforces the learning of cross-modal fusion. We conduct extensive experiments to validate the effectiveness of our method on a wide range of video-language downstream tasks including various zero-shot tasks, video question answering, and video-text retrieval. Despite its simplicity, our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsInternVideo: General Video Foundation Models via Generative and Discriminative Learning
