MILES: Visual BERT Pre-training with Injected Language Semantics for   Video-text Retrieval

Yuying Ge; Yixiao Ge; Xihui Liu; Alex Jinpeng Wang; Jianping Wu; Ying; Shan; Xiaohu Qie; Ping Luo

arXiv:2204.12408·cs.CV·April 27, 2022

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying, Shan, Xiaohu Qie, Ping Luo

PDF

Open Access 1 Repo

TL;DR

This paper introduces MILES, a novel video-text pre-training method using masked visual modeling with injected language semantics in a dual-encoder architecture, improving local visual feature learning and cross-modal alignment.

Contribution

It pioneers masked visual modeling with injected language semantics in video-text dual-encoder pre-training, enhancing local feature discrimination and retrieval performance.

Findings

01

Outperforms state-of-the-art on four datasets for text-to-video retrieval.

02

Significantly improves zero-shot action recognition accuracy.

03

Enhances local visual feature learning and cross-modal alignment.

Abstract

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible solution to address the above limitation. In this work, we for the first time investigate masked visual modeling in video-text pre-training with the "dual-encoder" architecture. We perform Masked visual modeling with Injected LanguagE Semantics (MILES) by employing an extra snapshot video encoder as an evolving "tokenizer" to produce reconstruction targets for masked video patch prediction. Given the corrupted video, the video encoder is trained to recover text-aligned features of the masked…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tencentarc/mcq
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Cancer-related molecular mechanisms research

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Layer Normalization · Dense Connections · Attention Dropout · Softmax · WordPiece