TERA: Self-Supervised Learning of Transformer Encoder Representation for   Speech

Andy T. Liu; Shang-Wen Li; and Hung-yi Lee

arXiv:2007.06028·eess.AS·August 5, 2021

TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

Andy T. Liu, Shang-Wen Li, and Hung-yi Lee

PDF

5 Repos

TL;DR

The paper introduces TERA, a self-supervised learning method for speech that uses alteration along multiple axes to pre-train Transformer Encoders, significantly improving performance on various speech tasks.

Contribution

It proposes a novel alteration-based pre-training approach for Transformer Encoders in speech, outperforming previous self-supervised models across multiple tasks.

Findings

01

TERA outperforms previous models in speech tasks

02

Smaller models learn better representations, larger models excel in fine-tuning

03

Pre-training on more data and diverse features enhances performance

Abstract

We introduce a self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration. Recent approaches often learn by using a single auxiliary task like contrastive prediction, autoregressive prediction, or masked reconstruction. Unlike previous methods, we use alteration along three orthogonal axes to pre-train Transformer Encoders on a large amount of unlabeled speech. The model learns through the reconstruction of acoustic frames from their altered counterpart, where we use a stochastic policy to alter along various dimensions: time, frequency, and magnitude. TERA can be used for speech representations extraction or fine-tuning with downstream models. We evaluate TERA on several downstream tasks, including phoneme classification, keyword spotting, speaker recognition, and speech recognition. We present a large-scale comparison…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Multi-Head Attention · Attention Is All You Need · Softmax · Label Smoothing · Adam · Dense Connections