CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip   Retrieval

Huaishao Luo; Lei Ji; Ming Zhong; Yang Chen; Wen Lei; Nan Duan,; Tianrui Li

arXiv:2104.08860·cs.CV·May 11, 2021·113 cites

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan,, Tianrui Li

PDF

Open Access 5 Repos 2 Models

TL;DR

This paper explores how to adapt the CLIP model for end-to-end video clip retrieval, investigating the effectiveness of image features, the impact of large-scale video-text pretraining, temporal modeling mechanisms, and hyper-parameter sensitivity, achieving state-of-the-art results.

Contribution

The paper introduces CLIP4Clip, a novel approach transferring CLIP knowledge to video-text retrieval, with comprehensive empirical studies on model design and training strategies.

Findings

01

Image features alone are insufficient for optimal video-text retrieval.

02

Large-scale pretraining improves retrieval performance significantly.

03

Temporal dependency modeling enhances video understanding.

Abstract

Video-text retrieval plays an essential role in multi-modal research and has been widely used in many real-world web applications. The CLIP (Contrastive Language-Image Pre-training), an image-language pre-training model, has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model on video-text retrieval task. Extensive experimental results present that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training