CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer   based on Supervised Semantic Tokens

Zhihao Du; Qian Chen; Shiliang Zhang; Kai Hu; Heng Lu; Yexin Yang,; Hangrui Hu; Siqi Zheng; Yue Gu; Ziyang Ma; Zhifu Gao; Zhijie Yan

arXiv:2407.05407·cs.SD·July 10, 2024·3 cites

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang,, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan

PDF

Open Access 1 Repo 10 Models 2 Datasets

TL;DR

CosyVoice introduces a scalable zero-shot multilingual TTS system using supervised semantic tokens derived from speech recognition models, significantly improving content accuracy and speaker similarity in voice cloning tasks.

Contribution

This paper presents the novel use of supervised semantic tokens in TTS, enhancing zero-shot synthesis and scalability compared to previous unsupervised token approaches.

Findings

01

Supervised semantic tokens outperform unsupervised tokens in content and speaker similarity.

02

Utilizing large-scale data further improves synthesis quality.

03

First integration of supervised speech tokens into TTS models.

Abstract

Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

funaudiollm/cosyvoice
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Video Analysis and Summarization