CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, Keyu An, Guanrou Yang, Yabin Li, Yanni Chen, Zhifu Gao, Qian Chen, Yue Gu, Mengzhe Chen, Yafeng Chen, Shiliang Zhang, Wen Wang, Jieping Ye

TL;DR
CosyVoice 3 advances in in-the-wild multilingual speech synthesis by scaling data and model size, introducing a new speech tokenizer, and a differentiable reward model, achieving improved naturalness, consistency, and speaker similarity.
Contribution
The paper presents CosyVoice 3, a novel multilingual speech synthesis model with a new speech tokenizer and post-training reward model, trained on a vastly larger dataset and larger model size.
Findings
Enhanced speech naturalness and prosody in multilingual synthesis.
Improved content consistency and speaker similarity.
Successful scaling of data and model size for better performance.
Abstract
In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗FunAudioLLM/Fun-CosyVoice3-0.5B-2512model· 6.5k dl· ♡ 4996.5k dl♡ 499
- 🤗FunAudioLLM/CosyVoice2-0.5Bmodel· 2.6k dl· ♡ 652.6k dl♡ 65
- 🤗FunAudioLLM/CosyVoice-300Mmodel· 556 dl· ♡ 7556 dl♡ 7
- 🤗FunAudioLLM/CosyVoice-300M-SFTmodel· 381 dl· ♡ 4381 dl♡ 4
- 🤗FunAudioLLM/CosyVoice-300M-Instructmodel· 275 dl· ♡ 11275 dl♡ 11
- 🤗FunAudioLLM/CosyVoice-ttsfrdmodel· ♡ 4♡ 4
- 🤗o6Dool/CossyVoice2_vietnamese_fintunemodel
- 🤗Translsis/fun-cosyvoice3-0.5b-2512-modelmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗Translsis/cosyvoice-ttsfrd-modelmodel
- 🤗Pragmaticl/fun-cosyvoice3-0.5b-2512-modelmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
