A Non-autoregressive Model for Joint STT and TTS
Vishal Sunder, Brian Kingsbury, George Saon, Samuel Thomas, Slava, Shechtman, Hagai Aronowitz, Eric Fosler-Lussier, Luis Lastras

TL;DR
This paper introduces a non-autoregressive, multimodal framework that jointly models speech recognition and synthesis, capable of handling unpaired data and iteratively refining its outputs for improved performance.
Contribution
It presents a novel joint STT and TTS model that is non-autoregressive, multimodal, and capable of iterative refinement, advancing speech processing capabilities.
Findings
Outperforms STT baseline in all tasks
Performs competitively with TTS baseline
Handles unpaired speech and text data effectively
Abstract
In this paper, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition
