TL;DR
The paper introduces TERA, a self-supervised learning method for speech that uses alteration along multiple axes to pre-train Transformer Encoders, significantly improving performance on various speech tasks.
Contribution
It proposes a novel alteration-based pre-training approach for Transformer Encoders in speech, outperforming previous self-supervised models across multiple tasks.
Findings
TERA outperforms previous models in speech tasks
Smaller models learn better representations, larger models excel in fine-tuning
Pre-training on more data and diverse features enhances performance
Abstract
We introduce a self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration. Recent approaches often learn by using a single auxiliary task like contrastive prediction, autoregressive prediction, or masked reconstruction. Unlike previous methods, we use alteration along three orthogonal axes to pre-train Transformer Encoders on a large amount of unlabeled speech. The model learns through the reconstruction of acoustic frames from their altered counterpart, where we use a stochastic policy to alter along various dimensions: time, frequency, and magnitude. TERA can be used for speech representations extraction or fine-tuning with downstream models. We evaluate TERA on several downstream tasks, including phoneme classification, keyword spotting, speaker recognition, and speech recognition. We present a large-scale comparison…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Multi-Head Attention · Attention Is All You Need · Softmax · Label Smoothing · Adam · Dense Connections
