Responsible Benchmarking of Fairness for Automatic Speech Recognition
Felix Herron, Ange Richard, Fran\c{c}ois Portet, Alexandre Allauzen, Solange Rossato

TL;DR
This paper proposes best practices for benchmarking fairness in automatic speech recognition, emphasizing precise hypotheses, tailored metrics, and intersectional analysis to improve reliability and avoid misinterpretation.
Contribution
It introduces a framework for more accurate fairness benchmarking in ASR by integrating interdisciplinary insights and advocating for detailed intersectional analysis.
Findings
Single demographic group evaluations can misidentify fairness issues.
Intersectional analysis reveals more accurate fairness assessments.
Best practices improve reliability of fairness benchmarking in ASR.
Abstract
Many studies have shown automatic speech processing (ASR) systems have unequal performance across speakergroups (SG's). However, the manner in which such studies arrive at this conclusion is inconsistent. To pave the wayfor more reliable results in future studies, we lay out best practices for benchmarking ASR fairness based on literaturefrom machine learning fairness, social sciences, and speech science. We first describe the importance of preciselythe fairness hypothesis being interrogated, and tailoring fairness metrics to apply specifically to said hypothesis.We then examine several benchmarks used to rate ASR systems on fairness and discuss how their results can bemisconstrued without assiduous oversight into the intersections between SG's. We find that evaluating fairnessbased on single heterogeneous SG's, such as they are defined in fairness benchmarks, can lead to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
