DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector   Quantization

Shaoshi Ling; Yuzong Liu

arXiv:2012.06659·eess.AS·December 15, 2020·58 cites

DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

Shaoshi Ling, Yuzong Liu

PDF

Open Access 1 Repo

TL;DR

DeCoAR 2.0 introduces a novel speech representation learning method using Transformers and vector quantization, significantly improving ASR performance especially in low-resource scenarios without fine-tuning.

Contribution

It proposes a new deep contextualized acoustic representation with vector quantization and a combined training objective, advancing speech representation learning.

Findings

01

Outperforms other representations in data-sparse scenarios

02

Lightweight ASR with DeCoAR 2.0 features surpasses full-data models

03

Consistent improvements across various experiments

Abstract

Recent success in speech representation learning enables a new way to leverage unlabeled data to train speech recognition model. In speech representation learning, a large amount of unlabeled data is used in a self-supervised manner to learn a feature representation. Then a smaller amount of labeled data is used to train a downstream ASR system using the new feature representations. Based on our previous work DeCoAR and inspirations from other speech representation learning, we propose DeCoAR 2.0, a Deep Contextualized Acoustic Representation with vector quantization. We introduce several modifications over the DeCoAR: first, we use Transformers in encoding module instead of LSTMs; second, we introduce a vector quantization layer between encoder and reconstruction modules; third, we propose an objective that combines the reconstructive loss with vector quantization diversity loss to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

awslabs/speech-representations
mxnet

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing