VATT: Transformers for Multimodal Self-Supervised Learning from Raw   Video, Audio and Text

Hassan Akbari; Liangzhe Yuan; Rui Qian; Wei-Hong Chuang; Shih-Fu; Chang; Yin Cui; Boqing Gong

arXiv:2104.11178·cs.CV·December 8, 2021·340 cites

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu, Chang, Yin Cui, Boqing Gong

PDF

Open Access 5 Repos 1 Video

TL;DR

VATT introduces a convolution-free Transformer framework for multimodal self-supervised learning from raw video, audio, and text signals, achieving state-of-the-art results across various downstream tasks without supervised pre-training.

Contribution

The paper presents VATT, a novel multimodal Transformer architecture trained end-to-end with contrastive losses, outperforming ConvNet-based models on multiple benchmarks.

Findings

01

VATT's vision Transformer achieves top-1 accuracy of 82.1% on Kinetics-400.

02

VATT's audio Transformer sets a new record with 39.4% mAP on AudioSet.

03

VATT generalizes well to image classification, reaching 78.7% top-1 accuracy on ImageNet.

Abstract

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic, single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text· slideslive

Taxonomy

TopicsHuman Pose and Action Recognition · Music and Audio Processing · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · VATT · Softmax · Label Smoothing · Layer Normalization · Residual Connection · Byte Pair Encoding