TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer
Vladimir Bataev, Subhankar Ghosh, Vitaly Lavrukhin, Jason Li

TL;DR
TTS-Transducer is an end-to-end text-to-speech system that combines neural transducers and audio codecs to improve speech synthesis quality and robustness without explicit duration modeling.
Contribution
It introduces a novel architecture that integrates neural transducers with audio codec models for end-to-end speech synthesis, avoiding explicit duration predictors.
Findings
Competitive speech quality compared to existing TTS systems
Robustness to variations in input text and speech
End-to-end training simplifies the pipeline
Abstract
This work introduces TTS-Transducer - a novel architecture for text-to-speech, leveraging the strengths of audio codec models and neural transducers. Transducers, renowned for their superior quality and robustness in speech recognition, are employed to learn monotonic alignments and allow for avoiding using explicit duration predictors. Neural audio codecs efficiently compress audio into discrete codes, revealing the possibility of applying text modeling approaches to speech generation. However, the complexity of predicting multiple tokens per frame from several codebooks, as necessitated by audio codec models with residual quantizers, poses a significant challenge. The proposed system first uses a transducer architecture to learn monotonic alignments between tokenized text and speech codec tokens for the first codebook. Next, a non-autoregressive Transformer predicts the remaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAbsolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer
