End-to-End Feedback Loss in Speech Chain Framework via Straight-Through   Estimator

Andros Tjandra; Sakriani Sakti; Satoshi Nakamura

arXiv:1810.13107·cs.CL·November 1, 2018·5 cites

End-to-End Feedback Loss in Speech Chain Framework via Straight-Through Estimator

Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

PDF

Open Access

TL;DR

This paper introduces an end-to-end training method for speech chain models using a straight-through estimator, enabling joint optimization of ASR and TTS modules with improved speech recognition accuracy.

Contribution

It proposes a novel approach to back-propagate through discrete ASR outputs using ST-Gumbel-Softmax, enhancing end-to-end speech chain training.

Findings

01

11% relative CER reduction in ASR performance

02

Effective end-to-end training of speech chain with discrete outputs

03

Improved reconstruction loss optimization

Abstract

The speech chain mechanism integrates automatic speech recognition (ASR) and text-to-speech synthesis (TTS) modules into a single cycle during training. In our previous work, we applied a speech chain mechanism as a semi-supervised learning. It provides the ability for ASR and TTS to assist each other when they receive unpaired data and let them infer the missing pair and optimize the model with reconstruction loss. If we only have speech without transcription, ASR generates the most likely transcription from the speech data, and then TTS uses the generated transcription to reconstruct the original speech features. However, in previous papers, we just limited our back-propagation to the closest module, which is the TTS part. One reason is that back-propagating the error through the ASR is challenging due to the output of the ASR are discrete tokens, creating non-differentiability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling