CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language   Representation Alignment

Hongwei Xue; Yuchong Sun; Bei Liu; Jianlong Fu; Ruihua Song; Houqiang; Li; Jiebo Luo

arXiv:2209.06430·cs.CV·March 3, 2023·53 cites

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang, Li, Jiebo Luo

PDF

Open Access 1 Repo

TL;DR

This paper introduces CLIP-ViP, a novel method that adapts pre-trained CLIP for video-language tasks by addressing domain gap and data scale issues, achieving state-of-the-art results on multiple datasets.

Contribution

Proposes CLIP-ViP, a new approach for effective post-pretraining adaptation of CLIP to video-language tasks, with a Video Proxy mechanism and cross-modal learning.

Findings

01

Significant improvement in video-text retrieval performance.

02

Achieves SOTA results on MSR-VTT, DiDeMo, LSMDC, and ActivityNet.

03

Highlights the importance of data scale and domain gap mitigation.

Abstract

The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/xpretrain
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Cancer-related molecular mechanisms research

MethodsContrastive Language-Image Pre-training