Addressing Index Collapse of Large-Codebook Speech Tokenizer with   Dual-Decoding Product-Quantized Variational Auto-Encoder

Haohan Guo; Fenglong Xie; Dongchao Yang; Hui Lu; Xixin Wu; Helen Meng

arXiv:2406.02940·cs.SD·June 6, 2024

Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder

Haohan Guo, Fenglong Xie, Dongchao Yang, Hui Lu, Xixin Wu, Helen Meng

PDF

Open Access

TL;DR

This paper introduces PQ-VAE, a novel speech tokenizer that uses multiple codebooks with fewer codewords to prevent index collapse, enabling larger codebooks and improving speech synthesis quality.

Contribution

It proposes a product-quantized VAE with dual-decoding training to effectively utilize large codebooks and address index collapse in speech tokenization.

Findings

01

PQ-VAE effectively prevents index collapse in large codebooks.

02

The dual-decoding training strategy improves codebook perplexity and reconstruction quality.

03

PQ-VAE enhances speech generation quality in language-model-based TTS.

Abstract

VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewords to address this problem and build large-codebook speech tokenizers. It encodes speech features into multiple VQ subspaces and composes them into codewords in a larger codebook. Besides, to utilize each VQ subspace well, we also enhance PQ-VAE via a dual-decoding training strategy with the encoding and quantized sequences. The experimental results demonstrate that PQ-VAE addresses ``index collapse" effectively, especially for larger codebooks. The model with the proposed training strategy further improves codebook perplexity and reconstruction quality, outperforming other multi-codebook VQ approaches. Finally, PQ-VAE demonstrates its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques