data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu,, Michael Auli

TL;DR
data2vec introduces a unified self-supervised learning framework applicable across speech, vision, and language, predicting contextualized latent representations with a Transformer, achieving state-of-the-art or competitive results on major benchmarks.
Contribution
It presents a novel general framework that unifies self-supervised learning across modalities using the same method and architecture.
Findings
Achieves state-of-the-art results in speech recognition
Sets new benchmarks in image classification
Performs competitively in natural language understanding
Abstract
While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/data2vec-audio-base-100hmodel· ♡ 1♡ 1
- 🤗facebook/data2vec-audio-base-10mmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗facebook/data2vec-audio-base-960hmodel· 5.7k dl· ♡ 125.7k dl♡ 12
- 🤗facebook/data2vec-audio-basemodel· 1.6k dl· ♡ 41.6k dl♡ 4
- 🤗facebook/data2vec-text-basemodel· 1.4k dl· ♡ 121.4k dl♡ 12
- 🤗patrickvonplaten/data2vec-basemodel· 2 dl2 dl
- 🤗facebook/data2vec-audio-largemodel· 226 dl· ♡ 1226 dl♡ 1
- 🤗facebook/data2vec-audio-large-10mmodel· 4 dl4 dl
- 🤗facebook/data2vec-audio-large-100hmodel· 4 dl· ♡ 24 dl♡ 2
- 🤗facebook/data2vec-audio-large-960hmodel· 2.1k dl· ♡ 72.1k dl♡ 7
Videos
ML4Audio- Data2vec paper discussion· youtube
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Absolute Position Encodings · Softmax · Byte Pair Encoding · Dropout · Label Smoothing
