Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Nina Shvetsova; Brian Chen; Andrew Rouditchenko; Samuel Thomas; Brian; Kingsbury; Rogerio Feris; David Harwath; James Glass; Hilde Kuehne

arXiv:2112.04446·cs.CV·August 19, 2022

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian, Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

PDF

1 Repo

TL;DR

This paper introduces a multi-modal fusion transformer that learns to integrate video, audio, and text data into a unified embedding for improved zero-shot video retrieval and localization, without relying on explicit modality encoding.

Contribution

It presents a modality-agnostic fusion transformer trained with a combinatorial loss, enabling flexible multi-modal input processing and state-of-the-art zero-shot video retrieval performance.

Findings

01

Achieved state-of-the-art zero-shot retrieval results on four benchmarks.

02

The model can process any number of modalities and variable input lengths.

03

Demonstrated effectiveness on large-scale HowTo100M dataset.

Abstract

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information. We propose to train the system with a combinatorial loss on everything at once, single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ninatu/everything_at_once
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.