A Non-autoregressive Model for Joint STT and TTS

Vishal Sunder; Brian Kingsbury; George Saon; Samuel Thomas; Slava; Shechtman; Hagai Aronowitz; Eric Fosler-Lussier; Luis Lastras

arXiv:2501.09104·cs.SD·January 22, 2025

A Non-autoregressive Model for Joint STT and TTS

Vishal Sunder, Brian Kingsbury, George Saon, Samuel Thomas, Slava, Shechtman, Hagai Aronowitz, Eric Fosler-Lussier, Luis Lastras

PDF

Open Access

TL;DR

This paper introduces a non-autoregressive, multimodal framework that jointly models speech recognition and synthesis, capable of handling unpaired data and iteratively refining its outputs for improved performance.

Contribution

It presents a novel joint STT and TTS model that is non-autoregressive, multimodal, and capable of iterative refinement, advancing speech processing capabilities.

Findings

01

Outperforms STT baseline in all tasks

02

Performs competitively with TTS baseline

03

Handles unpaired speech and text data effectively

Abstract

In this paper, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition