Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

Fangxun Shu; Biaolong Chen; Yue Liao; Shuwen Xiao; Wenyu Sun; Xiaobo; Li; Yousong Zhu; Jinqiao Wang; Si Liu

arXiv:2212.00986·cs.CV·December 6, 2022·5 cites

Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

Fangxun Shu, Biaolong Chen, Yue Liao, Shuwen Xiao, Wenyu Sun, Xiaobo, Li, Yousong Zhu, Jinqiao Wang, Si Liu

PDF

Open Access

TL;DR

This paper introduces MAC, a masked contrastive pre-training framework that enhances video-text retrieval efficiency by reducing redundancy and focusing on high-level alignment, leading to faster training and state-of-the-art results.

Contribution

The paper proposes a novel masked-then-alignment paradigm for video-text pre-training, improving efficiency and performance over traditional methods.

Findings

01

Reduces FLOPs by 60%

02

Accelerates pre-training by 3x

03

Achieves state-of-the-art results on multiple datasets

Abstract

We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC), for video-text retrieval tasks. Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency. Comparing conventional temporal sparse sampling, we propose to randomly mask a high ratio of spatial regions and only feed visible regions into the encoder as sparse spatial sampling. Similarly, we adopt the mask sampling technique for text inputs for consistency. Instead of blindly applying the mask-then-prediction paradigm from MAE, we propose a masked-then-alignment paradigm for efficient video-text alignment. The motivation is that video-text retrieval tasks rely on high-level alignment rather than low-level reconstruction, and multimodal alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Domain Adaptation and Few-Shot Learning

MethodsMasked autoencoder