TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer

Vladimir Bataev; Subhankar Ghosh; Vitaly Lavrukhin; Jason Li

arXiv:2501.06320·eess.AS·April 16, 2025

TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer

Vladimir Bataev, Subhankar Ghosh, Vitaly Lavrukhin, Jason Li

PDF

TL;DR

TTS-Transducer is an end-to-end text-to-speech system that combines neural transducers and audio codecs to improve speech synthesis quality and robustness without explicit duration modeling.

Contribution

It introduces a novel architecture that integrates neural transducers with audio codec models for end-to-end speech synthesis, avoiding explicit duration predictors.

Findings

01

Competitive speech quality compared to existing TTS systems

02

Robustness to variations in input text and speech

03

End-to-end training simplifies the pipeline

Abstract

This work introduces TTS-Transducer - a novel architecture for text-to-speech, leveraging the strengths of audio codec models and neural transducers. Transducers, renowned for their superior quality and robustness in speech recognition, are employed to learn monotonic alignments and allow for avoiding using explicit duration predictors. Neural audio codecs efficiently compress audio into discrete codes, revealing the possibility of applying text modeling approaches to speech generation. However, the complexity of predicting multiple tokens per frame from several codebooks, as necessitated by audio codec models with residual quantizers, poses a significant challenge. The proposed system first uses a transducer architecture to learn monotonic alignments between tokenized text and speech codec tokens for the first codebook. Next, a non-autoregressive Transformer predicts the remaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAbsolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer