UniSpeech: Unified Speech Representation Learning with Labeled and   Unlabeled Data

Chengyi Wang; Yu Wu; Yao Qian; Kenichi Kumatani; Shujie Liu; Furu Wei,; Michael Zeng; Xuedong Huang

arXiv:2101.07597·cs.CL·June 11, 2021·20 cites

UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei,, Michael Zeng, Xuedong Huang

PDF

Open Access 5 Repos 6 Models 1 Video

TL;DR

UniSpeech introduces a unified pre-training method combining supervised and self-supervised learning to enhance speech representations, improving cross-lingual and domain adaptation performance in speech recognition tasks.

Contribution

It presents a novel multi-task learning framework that integrates phonetic CTC and contrastive learning for better speech representation learning.

Findings

01

Achieves up to 13.4% relative phone error rate reduction across languages.

02

Demonstrates 6% relative word error rate reduction on domain-shift tasks.

03

Outperforms previous self-supervised and transfer learning methods.

Abstract

In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing