CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang,, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan

TL;DR
CosyVoice introduces a scalable zero-shot multilingual TTS system using supervised semantic tokens derived from speech recognition models, significantly improving content accuracy and speaker similarity in voice cloning tasks.
Contribution
This paper presents the novel use of supervised semantic tokens in TTS, enhancing zero-shot synthesis and scalability compared to previous unsupervised token approaches.
Findings
Supervised semantic tokens outperform unsupervised tokens in content and speaker similarity.
Utilizing large-scale data further improves synthesis quality.
First integration of supervised speech tokens into TTS models.
Abstract
Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗FunAudioLLM/Fun-CosyVoice3-0.5B-2512model· 6.5k dl· ♡ 4996.5k dl♡ 499
- 🤗FunAudioLLM/CosyVoice2-0.5Bmodel· 2.6k dl· ♡ 652.6k dl♡ 65
- 🤗MediaTek-Research/BreezyVoicemodel· ♡ 52♡ 52
- 🤗FunAudioLLM/CosyVoice-300Mmodel· 556 dl· ♡ 7556 dl♡ 7
- 🤗FunAudioLLM/CosyVoice-300M-SFTmodel· 381 dl· ♡ 4381 dl♡ 4
- 🤗FunAudioLLM/CosyVoice-300M-Instructmodel· 275 dl· ♡ 11275 dl♡ 11
- 🤗lucyknada/CosyVoice2-0.5Bmodel
- 🤗mrfakename/CosyVoice2-0.5Bmodel· ♡ 1♡ 1
- 🤗FunAudioLLM/CosyVoice-ttsfrdmodel· ♡ 4♡ 4
- 🤗gpustack/CosyVoice2-0.5Bmodel· 128 dl· ♡ 1128 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Video Analysis and Summarization
