When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
Pehu\'en Moure, Niclas Pokel, Bilal Bounajma, Yingqiang Gao, Roman Boehringer, Longbiao Cheng, Shih-Chii Liu

TL;DR
This paper evaluates whether current audio-language models can utilize clinical context to improve dysarthric speech recognition, finding they largely do not, but fine-tuning shows significant improvements in specific subgroups.
Contribution
The study introduces a benchmark for testing clinical context use in dysarthric speech recognition and demonstrates that fine-tuning with clinical prompts can substantially improve performance.
Findings
Models do not significantly benefit from clinical context prompts.
Fine-tuning with LoRA reduces WER by 52% compared to baseline.
Significant gains observed for Down syndrome and mild-severity speakers.
Abstract
Automatic speech recognition (ASR) systems remain brittle on dysarthric and other atypical speech. Recent audio-language models raise the possibility of improving performance by conditioning on additional clinical context at inference time, but it is unclear whether these models can make use of such information. We introduce a benchmark built on the Speech Accessibility Project (SAP) dataset that tests whether diagnosis labels, clinician-derived speech ratings, and progressively richer clinical descriptions improve transcription accuracy for dysarthric speech. Across matched comparisons on nine models, we find that current models do not meaningfully use this context: diagnosis-informed and clinically detailed prompts yield negligible improvements and often degrade word error rate. We complement the prompting analysis with context-dependent fine-tuning, showing that LoRA adaptation with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
