Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yor\`ub\'a
Opeyemi Osakuade, Simon King

TL;DR
This paper investigates the limitations of discrete speech units derived from self-supervised models in encoding lexical tone and prosody, highlighting the need for tone-aware quantisation methods.
Contribution
It demonstrates that current DSU quantisation methods inadequately encode suprasegmental features like tone, suggesting new tone-aware techniques are necessary.
Findings
SSL representations encode tone but DSUs prioritize phonetic structure
Quantisation methods, including K-means, struggle with tone encoding
A proposed residual clustering approach improves tone encoding
Abstract
Discrete speech units (DSUs) are derived by quantising representations from models trained using self-supervised learning (SSL). They are a popular representation for a wide variety of spoken language tasks, including those where prosody matters. DSUs are especially convenient for tasks where text and speech are jointly modelled, such as text-to-speech and multimodal dialogue systems. But we have found that DSUs encode suprasegmental information less reliably than segmental structure, which we demonstrate in this work using lexical tone, though this limitation likely extends to other suprasegmental features such as prosody. Our investigations using the tone languages Mandarin and Yor\`ub\'a show that the SSL latent representations themselves do encode tone, yet DSUs obtained using quantisation tend to prioritise phonetic structure, which makes lexical tone less reliably encoded. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
