data2vec: A General Framework for Self-supervised Learning in Speech,   Vision and Language

Alexei Baevski; Wei-Ning Hsu; Qiantong Xu; Arun Babu; Jiatao Gu,; Michael Auli

arXiv:2202.03555·cs.LG·October 27, 2022·240 cites

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu,, Michael Auli

PDF

Open Access 5 Repos 10 Models 1 Video

TL;DR

data2vec introduces a unified self-supervised learning framework applicable across speech, vision, and language, predicting contextualized latent representations with a Transformer, achieving state-of-the-art or competitive results on major benchmarks.

Contribution

It presents a novel general framework that unifies self-supervised learning across modalities using the same method and architecture.

Findings

01

Achieves state-of-the-art results in speech recognition

02

Sets new benchmarks in image classification

03

Performs competitively in natural language understanding

Abstract

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

ML4Audio- Data2vec paper discussion· youtube

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Absolute Position Encodings · Softmax · Byte Pair Encoding · Dropout · Label Smoothing