When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning
Khalid Adnan Alsayed

TL;DR
This paper reveals that different fairness metrics often produce conflicting assessments of bias in machine learning models, highlighting the need for more comprehensive evaluation methods.
Contribution
It systematically analyzes the inconsistency among multiple fairness metrics in face recognition, introducing the Fairness Disagreement Index to quantify this discrepancy.
Findings
Fairness assessments vary significantly across metrics.
Disagreement persists across thresholds and models.
Single-metric evaluation is unreliable for bias detection.
Abstract
The evaluation of fairness in machine learning systems has become a central concern in high-stakes applications, including biometric recognition, healthcare decision-making, and automated risk assessment. Existing approaches typically rely on a small number of fairness metrics to assess model behaviour across group partitions, implicitly assuming that these metrics provide consistent and reliable conclusions. However, different fairness metrics capture distinct statistical properties of model performance and may therefore produce conflicting assessments when applied to the same system. In this work, we investigate the consistency of fairness evaluation by conducting a systematic multi-metric analysis of demographic bias in machine learning models. Using face recognition as a controlled experimental setting, we evaluate model performance across multiple group partitions under a range of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
