CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan,, Tianrui Li

TL;DR
This paper explores how to adapt the CLIP model for end-to-end video clip retrieval, investigating the effectiveness of image features, the impact of large-scale video-text pretraining, temporal modeling mechanisms, and hyper-parameter sensitivity, achieving state-of-the-art results.
Contribution
The paper introduces CLIP4Clip, a novel approach transferring CLIP knowledge to video-text retrieval, with comprehensive empirical studies on model design and training strategies.
Findings
Image features alone are insufficient for optimal video-text retrieval.
Large-scale pretraining improves retrieval performance significantly.
Temporal dependency modeling enhances video understanding.
Abstract
Video-text retrieval plays an essential role in multi-modal research and has been widely used in many real-world web applications. The CLIP (Contrastive Language-Image Pre-training), an image-language pre-training model, has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model on video-text retrieval task. Extensive experimental results present that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
