TL;DR
T5Gemma-TTS is an encoder-decoder neural TTS model that maintains persistent text conditioning and improves duration control, trained on 170,000 hours of multilingual speech, achieving significant speaker similarity and low error rates.
Contribution
It introduces a novel encoder-decoder architecture with cross-attention at every layer and PM-RoPE for duration control, enhancing zero-shot multilingual TTS performance.
Findings
Achieves significant speaker similarity gains in Japanese and Korean.
Attains the lowest Japanese character error rate among baselines.
Disabling PM-RoPE severely degrades synthesis quality.
Abstract
Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We present T5Gemma-TTS, an encoder-decoder codec language model that maintains persistent text conditioning by routing bidirectional text representations through cross-attention at every decoder layer. Built on the T5Gemma pretrained encoder-decoder backbone (2B encoder + 2B decoder; 4B parameters), it inherits rich linguistic knowledge without phoneme conversion and processes text directly at the subword level. To improve duration control, we introduce Progress-Monitoring Rotary Position Embedding (PM-RoPE) in all 26 cross-attention layers, injecting normalized progress signals that help the decoder track target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
