Bias Detection via Maximum Subgroup Discrepancy

Ji\v{r}\'i N\v{e}me\v{c}ek; Mark Kozdoba; Illia Kryvoviaz; Tom\'a\v{s} Pevn\'y; Jakub Mare\v{c}ek

arXiv:2502.02221·cs.LG·June 12, 2025

Bias Detection via Maximum Subgroup Discrepancy

Ji\v{r}\'i N\v{e}me\v{c}ek, Mark Kozdoba, Illia Kryvoviaz, Tom\'a\v{s} Pevn\'y, Jakub Mare\v{c}ek

PDF

Open Access

TL;DR

This paper introduces the Maximum Subgroup Discrepancy (MSD), a new bias detection metric for AI data and outputs that is computationally feasible, interpretable, and effective in identifying biases across feature subgroups.

Contribution

The paper proposes MSD, a novel subgroup-based distance metric with linear sample complexity and a practical MIO-based evaluation algorithm, enhancing bias detection in AI systems.

Findings

01

MSD effectively detects biases in real-world datasets.

02

MSD has linear sample complexity relative to features.

03

MSD aligns well with a natural bias detection framework.

Abstract

Bias evaluation is fundamental to trustworthy AI, both in terms of checking data quality and in terms of checking the outputs of AI systems. In testing data quality, for example, one may study the distance of a given dataset, viewed as a distribution, to a given ground-truth reference dataset. However, classical metrics, such as the Total Variation and the Wasserstein distances, are known to have high sample complexities and, therefore, may fail to provide a meaningful distinction in many practical scenarios. In this paper, we propose a new notion of distance, the Maximum Subgroup Discrepancy (MSD). In this metric, two distributions are close if, roughly, discrepancies are low for all feature subgroups. While the number of subgroups may be exponential, we show that the sample complexity is linear in the number of features, thus making it feasible for practical applications. Moreover,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Process Monitoring · Statistical Methods and Inference