Multi-modal Transformer for Video Retrieval

Valentin Gabeur; Chen Sun; Karteek Alahari; Cordelia Schmid

arXiv:2007.10639·cs.CV·July 22, 2020

Multi-modal Transformer for Video Retrieval

Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-modal transformer model that jointly encodes visual and language data, effectively capturing cross-modal and temporal information, leading to state-of-the-art performance in video retrieval tasks.

Contribution

The paper proposes a novel multi-modal transformer architecture that jointly encodes video modalities and language, improving retrieval accuracy over existing methods.

Findings

01

Achieved state-of-the-art results on three video retrieval datasets.

02

Effectively models cross-modal interactions and temporal information.

03

Optimized language and visual embeddings jointly.

Abstract

The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gabeur/mmt
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization