MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
Alexander Kunitsyn, Maksim Kalashnikov, Maksim Dzabraev, Andrei, Ivaniuta

TL;DR
This paper introduces MDMMT-2, a state-of-the-art multimodal transformer model for text-to-video retrieval that effectively combines multiple data sources and training strategies to improve generalization and performance.
Contribution
The paper proposes a novel three-stage training process and a double positional encoding technique to enhance multimodal fusion and leverage noisy datasets for video retrieval.
Findings
Achieved state-of-the-art results on multiple video retrieval benchmarks.
Effectively combines weakly-supervised, crowd-labeled, and text-video datasets.
Demonstrates improved generalization with the proposed training and encoding methods.
Abstract
In this work we present a new State-of-The-Art on the text-to-video retrieval task on MSR-VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model. Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs. A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones. We introduce three-stage training procedure that provides high transfer knowledge efficiency and allows to use noisy datasets during training without prior knowledge degradation. Additionally, double positional encoding is used for better fusion of different modalities and a simple method for non-square inputs processing is suggested.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
