IndexTTS 2.5 Technical Report
Yunpei Li, Xun Zhou, Jinchao Wang, Lu Wang, Yong Wu, Siyi Zhou, Yiquan Zhou, Jingchen Shu

TL;DR
IndexTTS 2.5 advances zero-shot multilingual emotional TTS by enhancing efficiency, quality, and language coverage through codec compression, architectural improvements, cross-lingual strategies, and reinforcement learning.
Contribution
It introduces IndexTTS 2.5 with significant improvements in multilingual support, inference speed, and synthesis quality over prior models, including new modeling strategies and optimization techniques.
Findings
Supports four languages with zero-shot emotion transfer.
Achieves 2.28x faster inference speed.
Maintains comparable WER and speaker similarity.
Abstract
In prior work, we introduced IndexTTS 2, a zero-shot neural text-to-speech foundation model comprising two core components: a transformer-based Text-to-Semantic (T2S) module and a non-autoregressive Semantic-to-Mel (S2M) module, which together enable faithful emotion replication and establish the first autoregressive duration-controllable generative paradigm. Building upon this, we present IndexTTS 2.5, which significantly enhances multilingual coverage, inference speed, and overall synthesis quality through four key improvements: 1) Semantic Codec Compression: we reduce the semantic codec frame rate from 50 Hz to 25 Hz, halving sequence length and substantially lowering both training and inference costs; 2) Architectural Upgrade: we replace the U-DiT-based backbone of the S2M module with a more efficient Zipformer-based modeling architecture, achieving notable parameter reduction and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Topic Modeling · Mental Health via Writing
