Shedding light on underrepresentation and Sampling Bias in machine learning
Sami Zhioua, R\=uta Binkyt\.e

TL;DR
This paper clarifies different types of sampling bias in machine learning, analyzes their impact on fairness, and questions the effectiveness of simply collecting more data from underrepresented groups to reduce discrimination.
Contribution
It introduces clear definitions for sample size bias and underrepresentation bias, and analyzes how these biases affect fairness and model discrimination.
Findings
Bias can be decomposed into variance, bias, and noise.
Sampling bias affects fairness differently across groups.
Collecting more data from underrepresented groups may not always mitigate discrimination.
Abstract
Accurately measuring discrimination is crucial to faithfully assessing fairness of trained machine learning (ML) models. Any bias in measuring discrimination leads to either amplification or underestimation of the existing disparity. Several sources of bias exist and it is assumed that bias resulting from machine learning is born equally by different groups (e.g. females vs males, whites vs blacks, etc.). If, however, bias is born differently by different groups, it may exacerbate discrimination against specific sub-populations. Sampling bias, is inconsistently used in the literature to describe bias due to the sampling procedure. In this paper, we attempt to disambiguate this term by introducing clearly defined variants of sampling bias, namely, sample size bias (SSB) and underrepresentation bias (URB). We show also how discrimination can be decomposed into variance, bias, and noise.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
