Single-Codec: Single-Codebook Speech Codec towards High-Performance   Speech Generation

Hanzhao Li; Liumeng Xue; Haohan Guo; Xinfa Zhu; Yuanjun Lv; Lei Xie,; Yunlin Chen; Hao Yin; Zhifei Li

arXiv:2406.07422·eess.AS·June 12, 2024·Interspeech

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Hanzhao Li, Liumeng Xue, Haohan Guo, Xinfa Zhu, Yuanjun Lv, Lei Xie,, Yunlin Chen, Hao Yin, Zhifei Li

PDF

Open Access

TL;DR

Single-Codec introduces a single-codebook speech codec that improves efficiency and quality in speech generation by disentangling speech representations and enhancing encoding with contextual and resampling modules.

Contribution

It proposes a novel single-codebook, single-sequence speech codec using disentangled VQ-VAE and advanced encoder modules, outperforming multi-codebook codecs in quality and bandwidth.

Findings

01

Higher reconstruction quality than multi-codebook codecs

02

Lower bandwidth of only 304bps

03

Improved naturalness and intelligibility in LLM-TTS experiments

Abstract

The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermore, the encoder is enhanced with 1) contextual modeling with a BLSTM module to exploit the temporal information, 2) a hybrid sampling module to alleviate distortion from upsampling and downsampling, and 3) a resampling module to encourage discrete units to carry more phonetic information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec, Single-Codec demonstrates higher reconstruction quality with a lower bandwidth of only 304bps. The effectiveness of Single-Code is further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems

MethodsVQ-VAE