Qwen3-TTS Technical Report
Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin

TL;DR
Qwen3-TTS is a versatile, multilingual text-to-speech system capable of high-quality voice cloning, fine-grained control, and real-time streaming, trained on extensive speech data across multiple languages.
Contribution
This work introduces the Qwen3-TTS series with dual-tokenizer architecture, enabling state-of-the-art multilingual TTS, voice cloning, and ultra-low-latency streaming, with open-source release.
Findings
Achieves state-of-the-art results on multiple TTS benchmarks.
Supports 3-second voice cloning and description-based control.
Enables real-time streaming with minimal latency.
Abstract
In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission () through its 12.5 Hz,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoicemodel· 968k dl· ♡ 1362968k dl♡ 1362
- 🤗Qwen/Qwen3-TTS-12Hz-1.7B-Basemodel· 1.6M dl· ♡ 3611.6M dl♡ 361
- 🤗Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesignmodel· 639k dl· ♡ 308639k dl♡ 308
- 🤗Qwen/Qwen3-TTS-12Hz-0.6B-Basemodel· 508k dl· ♡ 210508k dl♡ 210
- 🤗Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoicemodel· 224k dl· ♡ 134224k dl♡ 134
- 🤗Qwen/Qwen3-TTS-Tokenizer-12Hzmodel· 63k dl· ♡ 5563k dl♡ 55
- 🤗xkos/Qwen3-TTS-12Hz-1.7B-ONNXmodel· 79 dl· ♡ 679 dl♡ 6
- 🤗Accordic/qwen3-tts-12hz-1-7b-customvoicemodel· 2 dl2 dl
- 🤗Accordic/qwen3-tts-12hz-1-7b-base-modelmodel· 2 dl2 dl
- 🤗Accordic/qwen3-tts-12hz-1-7b-voicedesign-modelmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques
