Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Fangxun Shu, Biaolong Chen, Yue Liao, Shuwen Xiao, Wenyu Sun, Xiaobo, Li, Yousong Zhu, Jinqiao Wang, Si Liu

TL;DR
This paper introduces MAC, a masked contrastive pre-training framework that enhances video-text retrieval efficiency by reducing redundancy and focusing on high-level alignment, leading to faster training and state-of-the-art results.
Contribution
The paper proposes a novel masked-then-alignment paradigm for video-text pre-training, improving efficiency and performance over traditional methods.
Findings
Reduces FLOPs by 60%
Accelerates pre-training by 3x
Achieves state-of-the-art results on multiple datasets
Abstract
We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC), for video-text retrieval tasks. Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency. Comparing conventional temporal sparse sampling, we propose to randomly mask a high ratio of spatial regions and only feed visible regions into the encoder as sparse spatial sampling. Similarly, we adopt the mask sampling technique for text inputs for consistency. Instead of blindly applying the mask-then-prediction paradigm from MAE, we propose a masked-then-alignment paradigm for efficient video-text alignment. The motivation is that video-text retrieval tasks rely on high-level alignment rather than low-level reconstruction, and multimodal alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Domain Adaptation and Few-Shot Learning
MethodsMasked autoencoder
