The Costs of Pretending That There Are Data-Generating Probability Distributions in the Social World
Benedikt H\"oltgen, Robert C. Williamson

TL;DR
This paper critiques the common assumption of true data-generating probability distributions in social machine learning, arguing it is misleading and proposing alternative population-focused frameworks.
Contribution
It challenges the existence of true probability distributions in social contexts and advocates for models centered on relevant populations instead.
Findings
True probability distributions do not exist in social settings.
Alternative population-based frameworks can replace traditional distribution assumptions.
Assuming true probabilities can obscure decision-making and goals in machine learning.
Abstract
Machine Learning research, including work promoting fair or equitable algorithms, often relies on the concept of a data-generating probability distribution. The standard presumption is that since data points are 'sampled from' such a distribution, one can learn from observed data about this distribution and, thus, predict future data points which are also drawn from it. We argue, however, that such true probability distributions do not exist and that the rhetoric around them is harmful in social settings. We show that alternative frameworks focusing directly on relevant populations rather than abstract distributions are available and leave classical learning theory almost unchanged. Furthermore, we argue that the assumption of true probabilities or data-generating distributions can be misleading and obscure both the choices made and the goals pursued in machine learning practice. Based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
