SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages

Tianyi Xu; Xuan Ouyang; Binwei Yao; Shoua Xiong; Sara Misurelli; Maichou Lor; Junjie Hu

arXiv:2601.09050·cs.CL·January 15, 2026

SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages

Tianyi Xu, Xuan Ouyang, Binwei Yao, Shoua Xiong, Sara Misurelli, Maichou Lor, Junjie Hu

PDF

Open Access

TL;DR

SITA is a novel training method that enhances speech representations for low-resource tonal languages by making them speaker-invariant and tone-aware, improving lexical retrieval and recognition accuracy.

Contribution

It introduces a staged multi-objective training approach for pretrained speech encoders to better capture tone and speaker invariance in low-resource languages.

Findings

01

Improves cross-gender lexical retrieval accuracy in Hmong.

02

Maintains ASR accuracy comparable to baseline models.

03

Demonstrates effectiveness on Mandarin, indicating general applicability.

Abstract

Tonal low-resource languages are widely spoken yet remain underserved by modern speech technology. A key challenge is learning representations that are robust to nuisance variation such as gender while remaining tone-aware for different lexical meanings. To address this, we propose SITA, a lightweight adaptation recipe that enforces Speaker-Invariance and Tone-Awareness for pretrained wav2vec-style encoders. SITA uses staged multi-objective training: (i) a cross-gender contrastive objective encourages lexical consistency across speakers, while a tone-repulsive loss prevents tone collapse by explicitly separating same-word different-tone realizations; and (ii) an auxiliary Connectionist Temporal Classification (CTC)-based ASR objective with distillation stabilizes recognition-relevant structure. We evaluate primarily on Hmong, a highly tonal and severely under-resourced language where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhonetics and Phonology Research · Speech Recognition and Synthesis · Speech and Audio Processing