DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization
Shaoshi Ling, Yuzong Liu

TL;DR
DeCoAR 2.0 introduces a novel speech representation learning method using Transformers and vector quantization, significantly improving ASR performance especially in low-resource scenarios without fine-tuning.
Contribution
It proposes a new deep contextualized acoustic representation with vector quantization and a combined training objective, advancing speech representation learning.
Findings
Outperforms other representations in data-sparse scenarios
Lightweight ASR with DeCoAR 2.0 features surpasses full-data models
Consistent improvements across various experiments
Abstract
Recent success in speech representation learning enables a new way to leverage unlabeled data to train speech recognition model. In speech representation learning, a large amount of unlabeled data is used in a self-supervised manner to learn a feature representation. Then a smaller amount of labeled data is used to train a downstream ASR system using the new feature representations. Based on our previous work DeCoAR and inspirations from other speech representation learning, we propose DeCoAR 2.0, a Deep Contextualized Acoustic Representation with vector quantization. We introduce several modifications over the DeCoAR: first, we use Transformers in encoding module instead of LSTMs; second, we introduce a vector quantization layer between encoder and reconstruction modules; third, we propose an objective that combines the reconstructive loss with vector quantization diversity loss to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
