Zero-Shot Recognition of Dysarthric Speech Using Commercial Automatic Speech Recognition and Multimodal Large Language Models

Ali Alsayegh; Tariq Masood

arXiv:2512.17474·eess.AS·December 22, 2025

Zero-Shot Recognition of Dysarthric Speech Using Commercial Automatic Speech Recognition and Multimodal Large Language Models

Ali Alsayegh, Tariq Masood

PDF

Open Access

TL;DR

This study evaluates commercial speech recognition and multimodal large language models on dysarthric speech, revealing severity-dependent performance gaps and potential for semantic preservation despite high error rates.

Contribution

It provides the first empirical baseline comparison of eight commercial ASR and MLLM systems on dysarthric speech, highlighting their strengths and limitations.

Findings

01

Mild dysarthria systems achieve 3-5% WER, close to typical speech.

02

Severe dysarthria results in over 49% WER across all systems.

03

GPT-4o reduces WER by 7.36 percentage points with consistent improvements.

Abstract

Voice-based human-machine interaction is a primary modality for accessing intelligent systems, yet individuals with dysarthria face systematic exclusion due to recognition performance gaps. Whilst automatic speech recognition (ASR) achieves word error rates (WER) below 5% on typical speech, performance degrades dramatically for dysarthric speakers. Multimodal large language models (MLLMs) offer potential for leveraging contextual reasoning to compensate for acoustic degradation, yet their zero-shot capabilities remain uncharacterised. This study evaluates eight commercial speech-to-text services on the TORGO dysarthric speech corpus: four conventional ASR systems (AssemblyAI, Whisper large-v3, Deepgram Nova-3, Nova-3 Medical) and four MLLM-based systems (GPT-4o, GPT-4o Mini, Gemini 2.5 Pro, Gemini 2.5 Flash). Evaluation encompasses lexical accuracy, semantic preservation, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVoice and Speech Disorders · Speech Recognition and Synthesis · Phonocardiography and Auscultation Techniques