Deep Contextualized Acoustic Representations For Semi-Supervised Speech   Recognition

Shaoshi Ling; Yuzong Liu; Julian Salazar; Katrin Kirchhoff

arXiv:1912.01679·eess.AS·May 15, 2020

Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

Shaoshi Ling, Yuzong Liu, Julian Salazar, Katrin Kirchhoff

PDF

1 Repo

TL;DR

This paper introduces DeCoAR, a deep contextualized acoustic representation learned from unlabeled audio, which significantly improves semi-supervised speech recognition performance and reduces labeled data requirements.

Contribution

The paper presents a novel semi-supervised ASR approach using representation learning from unlabeled data, outperforming traditional features and reducing labeled data needs.

Findings

01

DeCoAR outperforms filterbank features on WSJ and LibriSpeech.

02

Unsupervised pre-training with DeCoAR reduces labeled data requirements.

03

Performance with 100 hours of labeled data matches full 960-hour training.

Abstract

We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

awslabs/speech-representations
mxnet

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.