On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition

Shujie HU; Xurong Xie; Mengzhe Geng; Jiajun Deng; Huimeng Wang; Guinan Li; Chengxi Deng; Tianzi Wang; Mingyu Cui; Helen Meng; Xunying Liu

arXiv:2505.22072·cs.SD·May 29, 2025

On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition

Shujie HU, Xurong Xie, Mengzhe Geng, Jiajun Deng, Huimeng Wang, Guinan Li, Chengxi Deng, Tianzi Wang, Mingyu Cui, Helen Meng, Xunying Liu

PDF

Open Access

TL;DR

This paper introduces a real-time, zero-shot MoE-based speaker adaptation method for dysarthric speech recognition, significantly improving WER and processing speed over traditional batch adaptation methods.

Contribution

It presents a novel on-the-fly MoE framework that dynamically adapts to speakers without prior training, enhancing speech recognition for dysarthric speech.

Findings

01

Up to 1.34% absolute WER reduction over baseline

02

Achieves up to 7x faster processing speeds

03

Lowest published WER of 16.35% on UASpeech

Abstract

This paper proposes a novel MoE-based speaker adaptation framework for foundation models based dysarthric speech recognition. This approach enables zero-shot adaptation and real-time processing while incorporating domain knowledge. Speech impairment severity and gender conditioned adapter experts are dynamically combined using on-the-fly predicted speaker-dependent routing parameters. KL-divergence is used to further enforce diversity among experts and their generalization to unseen speakers. Experimental results on the UASpeech corpus suggest that on-the-fly MoE-based adaptation produces statistically significant WER reductions of up to 1.34% absolute (6.36% relative) over the unadapted baseline HuBERT/WavLM models. Consistent WER reductions of up to 2.55% absolute (11.44% relative) and RTF speedups of up to 7 times are obtained over batch-mode adaptation across varying speaker-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVoice and Speech Disorders · Speech Recognition and Synthesis · Speech and Audio Processing