Vec-Tok Speech: speech vectorization and tokenization for neural speech generation
Xinfa Zhu, Yuanjun Lv, Yi Lei, Tao Li, Wendi He, Hongbin Zhou, Heng, Lu, Lei Xie

TL;DR
Vec-Tok Speech introduces a novel speech codec combining speech vectors and semantic tokens, enabling high-fidelity, expressive speech generation and versatile tasks like voice conversion, style transfer, and denoising with improved performance over state-of-the-art models.
Contribution
The paper presents a new speech codec and framework that leverages language models for diverse speech generation tasks, enhancing quality and task generalization.
Findings
Outperforms state-of-the-art models on 50k hours of speech data.
Enables zero-shot voice conversion and style transfer.
Improves speech quality and task versatility.
Abstract
Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling
