MOPSA: Mixture of Prompt-Experts Based Speaker Adaptation for Elderly Speech Recognition

Chengxi Deng; Xurong Xie; Shujie Hu; Mengzhe Geng; Yicong Jiang; Jiankun Zhao; Jiajun Deng; Guinan Li; Youjun Chen; Huimeng Wang; Haoning Xu; Mingyu Cui; Xunying Liu

arXiv:2505.24224·eess.AS·June 2, 2025·Interspeech

MOPSA: Mixture of Prompt-Experts Based Speaker Adaptation for Elderly Speech Recognition

Chengxi Deng, Xurong Xie, Shujie Hu, Mengzhe Geng, Yicong Jiang, Jiankun Zhao, Jiajun Deng, Guinan Li, Youjun Chen, Huimeng Wang, Haoning Xu, Mingyu Cui, Xunying Liu

PDF

Open Access

TL;DR

This paper introduces MOPSA, a novel speaker adaptation method for elderly speech recognition that enables zero-shot, real-time adaptation using a mixture of prompt-experts, significantly improving accuracy and speed over traditional models.

Contribution

MOPSA leverages a mixture of prompt-experts and a dynamic router network for effective zero-shot, real-time elderly speaker adaptation in speech recognition.

Findings

01

Outperforms speaker-independent models with significant WER/CER reductions

02

Achieves up to 16.12x speed-up over offline adaptation

03

Effective on both English and Cantonese elderly speech datasets

Abstract

This paper proposes a novel Mixture of Prompt-Experts based Speaker Adaptation approach (MOPSA) for elderly speech recognition. It allows zero-shot, real-time adaptation to unseen speakers, and leverages domain knowledge tailored to elderly speakers. Top-K most distinctive speaker prompt clusters derived using K-means serve as experts. A router network is trained to dynamically combine clustered prompt-experts. Acoustic and language level variability among elderly speakers are modelled using separate encoder and decoder prompts for Whisper. Experiments on the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets suggest that online MOPSA adaptation outperforms the speaker-independent (SI) model by statistically significant word error rate (WER) or character error rate (CER) reductions of 0.86% and 1.47% absolute (4.21% and 5.40% relative). Real-time factor (RTF)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems