WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

Zachary Ellis; Jared Joselowitz; Yash Deo; Yajie He; Anna Kalygina; Aisling Higham; Mana Rahimzadeh; Yan Jia; Ibrahim Habli; Ernest Lim

arXiv:2511.16544·cs.CL·January 21, 2026

WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

Zachary Ellis, Jared Joselowitz, Yash Deo, Yajie He, Anna Kalygina, Aisling Higham, Mana Rahimzadeh, Yan Jia, Ibrahim Habli, Ernest Lim

PDF

Open Access

TL;DR

This paper demonstrates that traditional ASR evaluation metrics like WER poorly predict clinical impact, and introduces an LLM-based assessment tool that aligns closely with expert clinician judgments for safer clinical dialogue applications.

Contribution

It reveals the inadequacy of WER for clinical impact assessment and develops an LLM-based automated evaluation framework that mimics expert clinician judgments.

Findings

01

WER correlates poorly with clinical impact labels.

02

The LLM judge achieves 90% accuracy and high agreement with clinicians.

03

Proposes a scalable, automated safety assessment method for clinical ASR.

Abstract

As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA through DSPy to replicate expert clinical assessment. The optimized judge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Neurobiology of Language and Bilingualism