Estimating the Completeness of Discrete Speech Units

Sung-Lin Yeh; Hao Tang

arXiv:2409.06109·eess.AS·October 28, 2024

Estimating the Completeness of Discrete Speech Units

Sung-Lin Yeh, Hao Tang

PDF

Open Access

TL;DR

This paper investigates the information content of discrete speech units, revealing that residuals contain significant phonetic information and that vector quantization does not effectively disentangle speaker and phonetic features.

Contribution

It provides an information-theoretic analysis of discretized speech representations, offering bounds on information completeness and assessing the role of residuals.

Findings

01

Speaker information is present in HuBERT discrete units.

02

Phonetic information is retained in the residual after quantization.

03

Vector quantization does not achieve effective disentanglement of features.

Abstract

Representing speech with discrete units has been widely used in speech codec and speech generation. However, there are several unverified claims about self-supervised discrete units, such as disentangling phonetic and speaker information with k-means, or assuming information loss after k-means. In this work, we take an information-theoretic perspective to answer how much information is present (information completeness) and how much information is accessible (information accessibility), before and after residual vector quantization. We show a lower bound for information completeness and estimate completeness on discretized HuBERT representations after residual vector quantization. We find that speaker information is sufficiently present in HuBERT discrete units, and that phonetic information is sufficiently present in the residual, showing that vector quantization does not achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques