VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu, Chang, Yin Cui, Boqing Gong

TL;DR
VATT introduces a convolution-free Transformer framework for multimodal self-supervised learning from raw video, audio, and text signals, achieving state-of-the-art results across various downstream tasks without supervised pre-training.
Contribution
The paper presents VATT, a novel multimodal Transformer architecture trained end-to-end with contrastive losses, outperforming ConvNet-based models on multiple benchmarks.
Findings
VATT's vision Transformer achieves top-1 accuracy of 82.1% on Kinetics-400.
VATT's audio Transformer sets a new record with 39.4% mAP on AudioSet.
VATT generalizes well to image classification, reaching 78.7% top-1 accuracy on ImageNet.
Abstract
We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic, single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Music and Audio Processing · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · VATT · Softmax · Label Smoothing · Layer Normalization · Residual Connection · Byte Pair Encoding
