MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying, Shan, Xiaohu Qie, Ping Luo

TL;DR
This paper introduces MILES, a novel video-text pre-training method using masked visual modeling with injected language semantics in a dual-encoder architecture, improving local visual feature learning and cross-modal alignment.
Contribution
It pioneers masked visual modeling with injected language semantics in video-text dual-encoder pre-training, enhancing local feature discrimination and retrieval performance.
Findings
Outperforms state-of-the-art on four datasets for text-to-video retrieval.
Significantly improves zero-shot action recognition accuracy.
Enhances local visual feature learning and cross-modal alignment.
Abstract
Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible solution to address the above limitation. In this work, we for the first time investigate masked visual modeling in video-text pre-training with the "dual-encoder" architecture. We perform Masked visual modeling with Injected LanguagE Semantics (MILES) by employing an extra snapshot video encoder as an evolving "tokenizer" to produce reconstruction targets for masked video patch prediction. Given the corrupted video, the video encoder is trained to recover text-aligned features of the masked…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Cancer-related molecular mechanisms research
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Layer Normalization · Dense Connections · Attention Dropout · Softmax · WordPiece
