End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern   Architectures

Gabriel Synnaeve; Qiantong Xu; Jacob Kahn; Tatiana Likhomanenko,; Edouard Grave; Vineel Pratap; Anuroop Sriram; Vitaliy Liptchinsky; Ronan; Collobert

arXiv:1911.08460·cs.CL·July 16, 2020·166 cites

End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Tatiana Likhomanenko,, Edouard Grave, Vineel Pratap, Anuroop Sriram, Vitaliy Liptchinsky, Ronan, Collobert

PDF

Open Access 1 Repo

TL;DR

This paper explores semi-supervised learning for end-to-end speech recognition models using pseudo-labeling, demonstrating improved performance across architectures and establishing new state-of-the-art results on LibriSpeech.

Contribution

It introduces semi-supervised training with pseudo-labeling for various architectures, achieving new performance benchmarks and analyzing the impact of unlabeled data on model reliance on language models.

Findings

01

Semi-supervised training improves all architectures and loss functions.

02

Transformer models benefit from semi-supervision but still lag behind in supervised setting.

03

More unlabeled data reduces models' dependence on external language models.

Abstract

We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance with the supervised dataset alone, semi-supervision improves all models across architectures and loss functions and bridges much of the performance gaps between them. In doing so, we reach a new state-of-the-art for end-to-end acoustic models decoded with an external language model in the standard supervised learning setting, and a new absolute state-of-the-art with semi-supervised training. Finally, we study the effect of leveraging different amounts of unlabeled audio, propose several ways…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/wav2letter/tree/master/recipes/models/sota/2019
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Machine Learning and Algorithms · Anomaly Detection Techniques and Applications

MethodsSigmoid Activation · Tanh Activation · Average Pooling · Global Average Pooling · 1x1 Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Bottleneck Residual Block · Max Pooling · Kaiming Initialization