End-to-End Feedback Loss in Speech Chain Framework via Straight-Through Estimator
Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

TL;DR
This paper introduces an end-to-end training method for speech chain models using a straight-through estimator, enabling joint optimization of ASR and TTS modules with improved speech recognition accuracy.
Contribution
It proposes a novel approach to back-propagate through discrete ASR outputs using ST-Gumbel-Softmax, enhancing end-to-end speech chain training.
Findings
11% relative CER reduction in ASR performance
Effective end-to-end training of speech chain with discrete outputs
Improved reconstruction loss optimization
Abstract
The speech chain mechanism integrates automatic speech recognition (ASR) and text-to-speech synthesis (TTS) modules into a single cycle during training. In our previous work, we applied a speech chain mechanism as a semi-supervised learning. It provides the ability for ASR and TTS to assist each other when they receive unpaired data and let them infer the missing pair and optimize the model with reconstruction loss. If we only have speech without transcription, ASR generates the most likely transcription from the speech data, and then TTS uses the generated transcription to reconstruct the original speech features. However, in previous papers, we just limited our back-propagation to the closest module, which is the TTS part. One reason is that back-propagating the error through the ASR is challenging due to the output of the ASR are discrete tokens, creating non-differentiability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
