Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yor\`ub\'a

Opeyemi Osakuade; Simon King

arXiv:2604.07467·cs.CL·April 10, 2026

Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yor\`ub\'a

Opeyemi Osakuade, Simon King

PDF

TL;DR

This paper investigates the limitations of discrete speech units derived from self-supervised models in encoding lexical tone and prosody, highlighting the need for tone-aware quantisation methods.

Contribution

It demonstrates that current DSU quantisation methods inadequately encode suprasegmental features like tone, suggesting new tone-aware techniques are necessary.

Findings

01

SSL representations encode tone but DSUs prioritize phonetic structure

02

Quantisation methods, including K-means, struggle with tone encoding

03

A proposed residual clustering approach improves tone encoding

Abstract

Discrete speech units (DSUs) are derived by quantising representations from models trained using self-supervised learning (SSL). They are a popular representation for a wide variety of spoken language tasks, including those where prosody matters. DSUs are especially convenient for tasks where text and speech are jointly modelled, such as text-to-speech and multimodal dialogue systems. But we have found that DSUs encode suprasegmental information less reliably than segmental structure, which we demonstrate in this work using lexical tone, though this limitation likely extends to other suprasegmental features such as prosody. Our investigations using the tone languages Mandarin and Yor\`ub\'a show that the SSL latent representations themselves do encode tone, yet DSUs obtained using quantisation tend to prioritise phonetic structure, which makes lexical tone less reliably encoded. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.