CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang, Li, Jiebo Luo

TL;DR
This paper introduces CLIP-ViP, a novel method that adapts pre-trained CLIP for video-language tasks by addressing domain gap and data scale issues, achieving state-of-the-art results on multiple datasets.
Contribution
Proposes CLIP-ViP, a new approach for effective post-pretraining adaptation of CLIP to video-language tasks, with a Video Proxy mechanism and cross-modal learning.
Findings
Significant improvement in video-text retrieval performance.
Achieves SOTA results on MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
Highlights the importance of data scale and domain gap mitigation.
Abstract
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Cancer-related molecular mechanisms research
MethodsContrastive Language-Image Pre-training
