IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

Siyi Zhou; Yiquan Zhou; Yi He; Xun Zhou; Jinchao Wang; Wei Deng; Jingchen Shu

arXiv:2506.21619·cs.CL·September 4, 2025

IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu

PDF

Open Access 10 Models 1 Datasets

TL;DR

IndexTTS2 introduces a novel autoregressive TTS model that enables precise duration control, emotional expression disentanglement, and zero-shot speaker adaptation, significantly advancing speech naturalness and synchronization capabilities.

Contribution

The paper presents IndexTTS2, a new autoregressive TTS framework supporting explicit duration control, emotional and speaker disentanglement, and a soft instruction mechanism for improved zero-shot emotional speech synthesis.

Findings

01

Outperforms state-of-the-art zero-shot TTS models in key metrics.

02

Supports precise speech duration control in autoregressive generation.

03

Effectively disentangles emotion and speaker identity for flexible synthesis.

Abstract

Existing autoregressive large-scale text-to-speech (TTS) models have advantages in speech naturalness, but their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This becomes a significant limitation in applications requiring strict audio-visual synchronization, such as video dubbing. This paper introduces IndexTTS2, which proposes a novel, general, and autoregressive model-friendly method for speech duration control. The method supports two generation modes: one explicitly specifies the number of generated tokens to precisely control speech duration; the other freely generates speech in an autoregressive manner without specifying the number of tokens, while faithfully reproducing the prosodic features of the input prompt. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

echodict/index-tts
dataset· 149 dl
149 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis