Beyond Word Error Rate: Auditing the Diversity Tax in Speech Recognition through Dataset Cartography

Ting-Hui Cheng; Line H. Clemmensen; Sneha Das

arXiv:2603.05267·cs.LG·March 6, 2026

Beyond Word Error Rate: Auditing the Diversity Tax in Speech Recognition through Dataset Cartography

Ting-Hui Cheng, Line H. Clemmensen, Sneha Das

PDF

Open Access

TL;DR

This paper critiques the reliance on Word Error Rate in speech recognition, introducing new metrics and a framework to better identify biases and disparities affecting marginalized speakers.

Contribution

It introduces the sample difficulty index (SDI) and demonstrates how non-linear and semantic metrics reveal systemic biases overlooked by WER.

Findings

01

SDI correlates with demographic and acoustic factors influencing errors.

02

Metrics EmbER and SemDist uncover biases WER misses.

03

Proposes a framework for auditing ASR systems for disparities.

Abstract

Automatic speech recognition (ASR) systems are predominantly evaluated using the Word Error Rate (WER). However, raw token-level metrics fail to capture semantic fidelity and routinely obscures the `diversity tax', the disproportionate burden on marginalized and atypical speaker due to systematic recognition failures. In this paper, we explore the limitations of relying solely on lexical counts by systematically evaluating a broader class of non-linear and semantic metrics. To enable rigorous model auditing, we introduce the sample difficulty index (SDI), a novel metric that quantifies how intrinsic demographic and acoustic factors drive model failure. By mapping SDI on data cartography, we demonstrate that metrics EmbER and SemDist expose hidden systemic biases and inter-model disagreements that WER ignores. Finally, our findings are the first steps towards a robust audit framework for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Machine Learning and Data Classification