CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

Han Fang; Pengfei Xiong; Luhui Xu; Yu Chen

arXiv:2106.11097·cs.CV·June 22, 2021·130 cites

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen

PDF

Open Access 1 Repo

TL;DR

CLIP2Video leverages a pretrained image-language model to improve video-text retrieval by incorporating temporal dynamics and alignment, achieving state-of-the-art results on multiple benchmarks with a simplified, efficient framework.

Contribution

The paper introduces a novel two-stage video-text retrieval framework that adapts CLIP for videos by adding temporal modules, enabling effective training on smaller datasets.

Findings

01

Achieves new state-of-the-art retrieval accuracy on MSR-VTT, MSVD, and VATEX datasets.

02

Demonstrates the effectiveness of temporal modules in enhancing video-text alignment.

03

Validates the approach through comprehensive ablation studies.

Abstract

We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CryhanFang/CLIP2Video
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Natural Language Processing Techniques