Multi-modal Transformer for Video Retrieval
Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid

TL;DR
This paper introduces a multi-modal transformer model that jointly encodes visual and language data, effectively capturing cross-modal and temporal information, leading to state-of-the-art performance in video retrieval tasks.
Contribution
The paper proposes a novel multi-modal transformer architecture that jointly encodes video modalities and language, improving retrieval accuracy over existing methods.
Findings
Achieved state-of-the-art results on three video retrieval datasets.
Effectively models cross-modal interactions and temporal information.
Optimized language and visual embeddings jointly.
Abstract
The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
