Bridging Video-text Retrieval with Multiple Choice Questions

Yuying Ge; Yixiao Ge; Xihui Liu; Dian Li; Ying Shan; Xiaohu Qie and; Ping Luo

arXiv:2201.04850·cs.CV·March 18, 2022·1 cites

Bridging Video-text Retrieval with Multiple Choice Questions

Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, Xiaohu Qie and, Ping Luo

PDF

Open Access 2 Repos

TL;DR

This paper introduces a novel pretext task called Multiple Choice Questions (MCQ) with the BridgeFormer module to enhance fine-grained video-text retrieval, achieving state-of-the-art results efficiently across multiple datasets.

Contribution

The paper proposes a new training framework using MCQ and BridgeFormer to enable detailed video-text interactions while maintaining retrieval efficiency, outperforming existing methods.

Findings

01

Outperforms state-of-the-art on five datasets in zero-shot and fine-tune settings.

02

Effectively captures regional content and temporal dynamics in videos.

03

Achieves competitive results on single-modality downstream tasks with less pre-training data.

Abstract

Pre-training a model to learn transferable video-text representation for retrieval has attracted a lot of attention in recent years. Previous dominant works mainly adopt two separate encoders for efficient retrieval, but ignore local associations between videos and texts. Another line of research uses a joint encoder to interact video with texts, but results in low efficiency since each text-video pair needs to be fed into the model. In this work, we enable fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features. Specifically, we exploit the rich semantics of text (i.e., nouns and verbs) to build questions, with which the video encoder can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Cancer-related molecular mechanisms research