CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen

TL;DR
CLIP2Video leverages a pretrained image-language model to improve video-text retrieval by incorporating temporal dynamics and alignment, achieving state-of-the-art results on multiple benchmarks with a simplified, efficient framework.
Contribution
The paper introduces a novel two-stage video-text retrieval framework that adapts CLIP for videos by adding temporal modules, enabling effective training on smaller datasets.
Findings
Achieves new state-of-the-art retrieval accuracy on MSR-VTT, MSVD, and VATEX datasets.
Demonstrates the effectiveness of temporal modules in enhancing video-text alignment.
Validates the approach through comprehensive ablation studies.
Abstract
We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Natural Language Processing Techniques
