RepCodec: A Speech Representation Codec for Speech Tokenization

Zhichao Huang; Chutong Meng; Tom Ko

arXiv:2309.00169·eess.AS·July 23, 2024

RepCodec: A Speech Representation Codec for Speech Tokenization

Zhichao Huang, Chutong Meng, Tom Ko

PDF

Open Access 1 Repo 3 Reviews

TL;DR

RepCodec is a novel speech representation codec that improves semantic speech tokenization by better preserving information, leading to enhanced performance in speech understanding and generation across multiple languages.

Contribution

It introduces a new codec that learns a vector quantization codebook from speech representations, outperforming traditional clustering methods in speech tokenization.

Findings

01

Significantly outperforms k-means clustering in speech tasks.

02

Effective across various speech encoders and languages.

03

Enhances information retention in speech tokenization.

Abstract

With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech tokenization. In contrast to audio codecs which reconstruct the raw audio, RepCodec learns a vector quantization codebook through reconstructing speech representations from speech encoders like HuBERT or data2vec. Together, the speech encoder, the codec encoder and the vector quantization codebook form a pipeline for converting speech waveforms into semantic tokens. The extensive experiments illustrate that RepCodec, by virtue of its enhanced information retention capacity, significantly outperforms…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

The paper is great in its clarity and well-structured organization. Its proposed approach is lauded for its simplicity and effectiveness. The comprehensive nature of the experiments conducted further strengthens the paper's credibility. Based on these positive aspects, it is recommended for publication at the conference.

Weaknesses

The simplicity and effectiveness of the proposed approach are commendable. While there are no significant weaknesses to highlight, it would be intriguing to see the application of RepCodec in the context of zero-shot Text-to-Speech (TTS) systems, such as Vall-E. Exploring its potential in this domain could provide valuable insights and possibly further advancements in speech-processing technology. The idea of SpeechTokenizer (SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language M

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

* The authors demonstrate the superiority of their proposed method over other discrete speech representation techniques in terms of the WER scores on both ASR and speech resynthesis tasks. * The authors analyze the issue with the quality measure of semantic tokens based on their similarity to ground truth phonemes, while illustrating that the reconstruction loss of their proposed method exhibits a higher correlation.

Weaknesses

* Insufficient evaluation metrics. The research predominantly relies on WER as the principal evaluation metric for the performance of semantic speech tokens. To make a compelling case for the proposed method's superiority, it's essential to include other the evaluation metrics such as speaker similarity, F0 error, or mean-opinion score in the speech resynthesis experiments. * Limited exploration of core downstream tasks. While semantic tokens are integral to token-based language modeling of spee

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1. RepCodec demonstrates promising results in both ASR and unit-to-speech resynthesis compared to the clustering method. 2. The discovery that PMNI can deviate from performance is intriguing.

Weaknesses

1. Overall, this paper lacks novelty, as compared to SouldStream, it simply replaces the input from raw waveform with SSL representations. 2. Some parts of the details in this paper are confusing: * The difference in bar height in the encoder and decoder parts in Figure 1 is confusing because neither sampling nor dimension reduction is applied. * Equation (5) lacks sufficient explanation. I am unsure of its correctness as neither ${\overset{\sim}{n_k}}$ nor $\mathbf{e}_{i}$ is adequately

Code & Models

Repositories

mct10/repcodec
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

Methodsk-Means Clustering