MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One   More Step Towards Generalization

Alexander Kunitsyn; Maksim Kalashnikov; Maksim Dzabraev; Andrei; Ivaniuta

arXiv:2203.07086·cs.CV·March 15, 2022·1 cites

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

Alexander Kunitsyn, Maksim Kalashnikov, Maksim Dzabraev, Andrei, Ivaniuta

PDF

Open Access

TL;DR

This paper introduces MDMMT-2, a state-of-the-art multimodal transformer model for text-to-video retrieval that effectively combines multiple data sources and training strategies to improve generalization and performance.

Contribution

The paper proposes a novel three-stage training process and a double positional encoding technique to enhance multimodal fusion and leverage noisy datasets for video retrieval.

Findings

01

Achieved state-of-the-art results on multiple video retrieval benchmarks.

02

Effectively combines weakly-supervised, crowd-labeled, and text-video datasets.

03

Demonstrates improved generalization with the proposed training and encoding methods.

Abstract

In this work we present a new State-of-The-Art on the text-to-video retrieval task on MSR-VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model. Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs. A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones. We introduce three-stage training procedure that provides high transfer knowledge efficiency and allows to use noisy datasets during training without prior knowledge degradation. Additionally, double positional encoding is used for better fusion of different modalities and a simple method for non-square inputs processing is suggested.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization