An Exploration of How Training Set Composition Bias in Machine Learning Affects Identifying Rare Objects
Sean E. Lake, Chao-Wei Tsai

TL;DR
This paper investigates how training set composition biases, such as up-weighting rare classes or balancing data, can skew machine learning classifiers towards over-predicting rare objects, and proposes methods to detect and mitigate this bias.
Contribution
It introduces statistical techniques to identify and reduce training data bias effects on classifiers, providing universally applicable solutions.
Findings
Bias from up-weighting rare classes can lead to over-prediction.
Detection methods for training bias are effective in various scenarios.
Bias mitigation techniques modestly improve model accuracy.
Abstract
When training a machine learning classifier on data where one of the classes is intrinsically rare, the classifier will often assign too few sources to the rare class. To address this, it is common to up-weight the examples of the rare class to ensure it isn't ignored. It is also a frequent practice to train on restricted data where the balance of source types is closer to equal for the same reason. Here we show that these practices can bias the model toward over-assigning sources to the rare class. We also explore how to detect when training data bias has had a statistically significant impact on the trained model's predictions, and how to reduce the bias's impact. While the magnitude of the impact of the techniques developed here will vary with the details of the application, for most cases it should be modest. They are, however, universally applicable to every time a machine learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
