Advanced Modeling of Interlanguage Speech Intelligibility Benefit with L1-L2 Multi-Task Learning Using Differentiable K-Means for Accent-Robust Discrete Token-Based ASR

Kentaro Onda; Satoru Fukayama; Daisuke Saito; Nobuaki Minematsu

arXiv:2601.19767·cs.SD·January 28, 2026

Advanced Modeling of Interlanguage Speech Intelligibility Benefit with L1-L2 Multi-Task Learning Using Differentiable K-Means for Accent-Robust Discrete Token-Based ASR

Kentaro Onda, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu

PDF

Open Access

TL;DR

This paper introduces a novel approach to improve accent-robust speech recognition by modeling interlanguage speech intelligibility benefit using differentiable k-means in a multi-task learning framework, achieving significant accuracy gains.

Contribution

It proposes an advanced modeling technique employing differentiable k-means for L1-L2 multi-task learning to enhance accent robustness in ASR systems, surpassing previous methods.

Findings

01

Achieved approximately 20% relative improvement in recognition accuracy.

02

Outperformed baseline models in both native and accented speech scenarios.

03

Demonstrated effectiveness of differentiable k-means in modeling ISIB.

Abstract

Building ASR systems robust to foreign-accented speech is an important challenge in today's globalized world. A prior study explored the way to enhance the performance of phonetic token-based ASR on accented speech by reproducing the phenomenon known as interlanguage speech intelligibility benefit (ISIB), where foreign-accented speech is more intelligible to listeners sharing the speaker's native language than to native listeners. ISIB was technically implemented by using the speaker's L1 to learn k-means cluster centroids in an SSL feature space to obtain phonetic tokens. In this study, we propose a more advanced modeling of ISIB. By employing differentiable k-means and optimizing the entire module for both L1 and L2 ASR, the proposed method outperformed the baselines, both when using only native speech and when additionally incorporating a limited amount of accented speech. Notably,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Voice and Speech Disorders