CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du; Changfeng Gao; Yuxuan Wang; Fan Yu; Tianyu Zhao; Hao Wang; Xiang Lv; Hui Wang; Chongjia Ni; Xian Shi; Keyu An; Guanrou Yang; Yabin Li; Yanni Chen; Zhifu Gao; Qian Chen; Yue Gu; Mengzhe Chen; Yafeng Chen; Shiliang Zhang; Wen Wang; Jieping Ye

arXiv:2505.17589·cs.SD·May 28, 2025

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, Keyu An, Guanrou Yang, Yabin Li, Yanni Chen, Zhifu Gao, Qian Chen, Yue Gu, Mengzhe Chen, Yafeng Chen, Shiliang Zhang, Wen Wang, Jieping Ye

PDF

Open Access 2 Repos 10 Models

TL;DR

CosyVoice 3 advances in in-the-wild multilingual speech synthesis by scaling data and model size, introducing a new speech tokenizer, and a differentiable reward model, achieving improved naturalness, consistency, and speaker similarity.

Contribution

The paper presents CosyVoice 3, a novel multilingual speech synthesis model with a new speech tokenizer and post-training reward model, trained on a vastly larger dataset and larger model size.

Findings

01

Enhanced speech naturalness and prosody in multilingual synthesis.

02

Improved content consistency and speaker similarity.

03

Successful scaling of data and model size for better performance.

Abstract

In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis