IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu

TL;DR
IndexTTS2 introduces a novel autoregressive TTS model that enables precise duration control, emotional expression disentanglement, and zero-shot speaker adaptation, significantly advancing speech naturalness and synchronization capabilities.
Contribution
The paper presents IndexTTS2, a new autoregressive TTS framework supporting explicit duration control, emotional and speaker disentanglement, and a soft instruction mechanism for improved zero-shot emotional speech synthesis.
Findings
Outperforms state-of-the-art zero-shot TTS models in key metrics.
Supports precise speech duration control in autoregressive generation.
Effectively disentangles emotion and speaker identity for flexible synthesis.
Abstract
Existing autoregressive large-scale text-to-speech (TTS) models have advantages in speech naturalness, but their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This becomes a significant limitation in applications requiring strict audio-visual synchronization, such as video dubbing. This paper introduces IndexTTS2, which proposes a novel, general, and autoregressive model-friendly method for speech duration control. The method supports two generation modes: one explicitly specifies the number of generated tokens to precisely control speech duration; the other freely generates speech in an autoregressive manner without specifying the number of tokens, while faithfully reproducing the prosodic features of the input prompt. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗IndexTeam/IndexTTS-2model· 18k dl· ♡ 66718k dl♡ 667
- 🤗Toxzic/indextts-colabmodel
- 🤗garyswansrs/index_tts_2_vllmmodel· 2 dl2 dl
- 🤗Jmica/IndexTTS2model· 13 dl13 dl
- 🤗Kriest/IndexTTS2model
- 🤗dinhthuan/index-tts-2-vietnamesemodel· 100 dl· ♡ 20100 dl♡ 20
- 🤗jaman21/IndexTTS-2model· 2 dl2 dl
- 🤗Pragmaticl/index-tts-2-vietnamese-modelmodel· 6 dl6 dl
- 🤗Pragmaticl/indextts-2-modelmodel· 1 dl1 dl
- 🤗litagin/IndexTTS-2-duplicatedmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis
