Almost Unsupervised Text to Speech and Automatic Speech Recognition
Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

TL;DR
This paper introduces an almost unsupervised method for TTS and ASR that uses minimal paired data and leverages dual task relationships with a Transformer-based model, achieving high performance on low-resource datasets.
Contribution
It proposes a novel almost unsupervised learning framework leveraging dual tasks, denoising auto-encoders, and bidirectional sequence modeling for TTS and ASR with limited paired data.
Findings
Achieves 99.84% word intelligibility rate on LJSpeech.
Attains 2.68 MOS for TTS and 11.7% PER for ASR.
Uses only 200 paired samples (~20 minutes audio) plus unpaired data.
Abstract
Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data. However, the lack of aligned data poses a major practical problem for TTS and ASR on low-resource languages. In this paper, by leveraging the dual nature of the two tasks, we propose an almost unsupervised learning method that only leverages few hundreds of paired data and extra unpaired data for TTS and ASR. Our method consists of the following components: (1) a denoising auto-encoder, which reconstructs speech and text sequences respectively to develop the capability of language modeling both in speech and text domain; (2) dual transformation, where the TTS model transforms the text into speech , and the ASR model leverages the transformed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
