Continuous Speech Tokenizer in Text To Speech
Yixing Li, Ruobing Xie, Xingwu Sun, Yu Cheng, Zhanhui Kang

TL;DR
This paper introduces Cont-SPT, a continuous speech tokenizer for text-to-speech systems that preserves more information than discrete tokenizers, leading to improved speech quality and continuity.
Contribution
The paper proposes a novel continuous speech tokenizer, Cont-SPT, which reduces information loss in speech representation for TTS applications.
Findings
Cont-SPT achieves higher estimated MoS scores.
Cont-SPT preserves more information across frequency spectrum.
The approach improves speech continuity in TTS models.
Abstract
The fusion of speech and language in the era of large language models has garnered significant attention. Discrete speech token is often utilized in text-to-speech tasks for speech compression and portability, which is convenient for joint training with text and have good compression efficiency. However, we found that the discrete speech tokenizer still suffers from information loss. Therefore, we propose a simple yet effective continuous speech tokenizer named Cont-SPT, and a text-to-speech model based on continuous speech tokens. Our results show that the speech language model based on the continuous speech tokenizer has better continuity and higher estimated Mean Opinion Scores (MoS). This enhancement is attributed to better information preservation rate of the continuous speech tokenizer across both low and high frequencies in the frequency domain. The code and resources for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
