T5Gemma-TTS Technical Report

Chihiro Arata; Kiyoshi Kurihara

arXiv:2604.01760·eess.AS·April 3, 2026

T5Gemma-TTS Technical Report

Chihiro Arata, Kiyoshi Kurihara

PDF

1 Repo 1 Models

TL;DR

T5Gemma-TTS is an encoder-decoder neural TTS model that maintains persistent text conditioning and improves duration control, trained on 170,000 hours of multilingual speech, achieving significant speaker similarity and low error rates.

Contribution

It introduces a novel encoder-decoder architecture with cross-attention at every layer and PM-RoPE for duration control, enhancing zero-shot multilingual TTS performance.

Findings

01

Achieves significant speaker similarity gains in Japanese and Korean.

02

Attains the lowest Japanese character error rate among baselines.

03

Disabling PM-RoPE severely degrades synthesis quality.

Abstract

Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We present T5Gemma-TTS, an encoder-decoder codec language model that maintains persistent text conditioning by routing bidirectional text representations through cross-attention at every decoder layer. Built on the T5Gemma pretrained encoder-decoder backbone (2B encoder + 2B decoder; 4B parameters), it inherits rich linguistic knowledge without phoneme conversion and processes text directly at the subword level. To improve duration control, we introduce Progress-Monitoring Rotary Position Embedding (PM-RoPE) in all 26 cross-attention layers, injecting normalized progress signals that help the decoder track target…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Aratako/T5Gemma-TTS
github

Models

🤗
Aratako/T5Gemma-TTS-2b-2b
model· 627 dl· ♡ 117
627 dl♡ 117

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.