Systematic Bias in Sample Inference and its Effect on Machine Learning
Owen O'Neill, Fintan Costello

TL;DR
This paper identifies that systematic bias in machine learning, especially underprediction for minorities, arises from small-sample statistical inference, and demonstrates this bias through analysis of decision tree models on real datasets.
Contribution
It reveals that small-sample inference causes systematic bias in ML predictions, explaining underprediction patterns for minority groups, supported by empirical analysis.
Findings
Bias correlates strongly with small-sample inference in models
Underprediction is more severe for minority groups due to sample size effects
Small-sample inference bias explains observed underprediction patterns
Abstract
A commonly observed pattern in machine learning models is an underprediction of the target feature, with the model's predicted target rate for members of a given category typically being lower than the actual target rate for members of that category in the training set. This underprediction is usually larger for members of minority groups; while income level is underpredicted for both men and women in the 'adult' dataset, for example, the degree of underprediction is significantly higher for women (a minority in that dataset). We propose that this pattern of underprediction for minorities arises as a predictable consequence of statistical inference on small samples. When presented with a new individual for classification, an ML model performs inference not on the entire training set, but on a subset that is in some way similar to the new individual, with sizes of these subsets typically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Analysis with R · Forecasting Techniques and Applications · Imbalanced Data Classification Techniques
