Almost Unsupervised Text to Speech and Automatic Speech Recognition

Yi Ren; Xu Tan; Tao Qin; Sheng Zhao; Zhou Zhao; Tie-Yan Liu

arXiv:1905.06791·eess.AS·July 28, 2020·40 cites

Almost Unsupervised Text to Speech and Automatic Speech Recognition

Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

PDF

Open Access

TL;DR

This paper introduces an almost unsupervised method for TTS and ASR that uses minimal paired data and leverages dual task relationships with a Transformer-based model, achieving high performance on low-resource datasets.

Contribution

It proposes a novel almost unsupervised learning framework leveraging dual tasks, denoising auto-encoders, and bidirectional sequence modeling for TTS and ASR with limited paired data.

Findings

01

Achieves 99.84% word intelligibility rate on LJSpeech.

02

Attains 2.68 MOS for TTS and 11.7% PER for ASR.

03

Uses only 200 paired samples (~20 minutes audio) plus unpaired data.

Abstract

Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data. However, the lack of aligned data poses a major practical problem for TTS and ASR on low-resource languages. In this paper, by leveraging the dual nature of the two tasks, we propose an almost unsupervised learning method that only leverages few hundreds of paired data and extra unpaired data for TTS and ASR. Our method consists of the following components: (1) a denoising auto-encoder, which reconstructs speech and text sequences respectively to develop the capability of language modeling both in speech and text domain; (2) dual transformation, where the TTS model transforms the text $y$ into speech $\overset{x}{^}$ , and the ASR model leverages the transformed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax