UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data
Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei,, Michael Zeng, Xuedong Huang

TL;DR
UniSpeech introduces a unified pre-training method combining supervised and self-supervised learning to enhance speech representations, improving cross-lingual and domain adaptation performance in speech recognition tasks.
Contribution
It presents a novel multi-task learning framework that integrates phonetic CTC and contrastive learning for better speech representation learning.
Findings
Achieves up to 13.4% relative phone error rate reduction across languages.
Demonstrates 6% relative word error rate reduction on domain-shift tasks.
Outperforms previous self-supervised and transfer learning methods.
Abstract
In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/unispeech-1350-en-168-es-ft-1hmodel· 5 dl5 dl
- 🤗microsoft/unispeech-1350-en-17h-ky-ft-1hmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗microsoft/unispeech-1350-en-353-fr-ft-1hmodel· 2 dl2 dl
- 🤗microsoft/unispeech-1350-en-90-it-ft-1hmodel· 5 dl5 dl
- 🤗microsoft/unispeech-large-1500h-cvmodel· 29k dl· ♡ 129k dl♡ 1
- 🤗microsoft/unispeech-large-multi-lingual-1500h-cvmodel· 3 dl· ♡ 13 dl♡ 1
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
