How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer

Minu Kim; Ji Sub Um; Hoirin Kim

arXiv:2511.12285·eess.AS·January 27, 2026

How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer

Minu Kim, Ji Sub Um, Hoirin Kim

PDF

Open Access

TL;DR

This paper investigates how self-supervised speech models capture lexical tone across four diverse languages, revealing that tone transfer depends on the downstream task and varies in temporal focus.

Contribution

It provides the first analysis of tone representation in SSL speech models across multiple tone languages and shows how task type influences temporal focus in tone transfer.

Findings

01

Tone cues span approximately 100-180ms depending on language.

02

Tone transfer varies with downstream task, affecting temporal focus.

03

Task type influences how SSL models utilize tone information.

Abstract

Lexical tone is central to many languages but remains underexplored in self-supervised learning (SSL) speech models, especially beyond Mandarin. We study four languages with complex and diverse tone systems (Burmese, Thai, Lao, and Vietnamese) to ask how far such models "listen" for tone and how transfer operates in low-resource conditions. As a baseline reference, we estimate the temporal span of tone cues: approximately 100ms (Burmese/Thai) and 180ms (Lao/Vietnamese). Probes and gradient analysis on fine-tuned SSL models reveal that tone transfer varies by downstream task: automatic speech recognition fine-tuning aligns spans with language-specific tone cues, while prosody- and voice-related tasks bias toward overly long spans. These findings indicate that tone transfer is shaped by downstream task, highlighting task effects on temporal focus in tone modeling.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhonetics and Phonology Research · Speech Recognition and Synthesis · Neurobiology of Language and Bilingualism