Data Representativity for Machine Learning and AI Systems
Line H. Clemmensen, Rune D. Kj{\ae}rsgaard

TL;DR
This paper examines the importance of data representativity in AI, introduces measurable concepts for evaluating samples, and proposes a framework to improve data documentation and selection for better inference.
Contribution
It introduces three measurable concepts of data representativity, analyzes their implications, and offers a framework for creating and documenting representative datasets in AI.
Findings
Coverage-based vs distribution-mimicking representativity has distinct effects
Empirical evaluation on US Census data illustrates contrasting qualities
Proposes a practical framework for data creation and documentation
Abstract
Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper reviews definitions and notions of a representative sample and surveys their use in scientific AI literature. We introduce three measurable concepts to help focus the notions and evaluate different data samples. Furthermore, we demonstrate that the contrast between a representative sample in the sense of coverage of the input space, versus a representative sample mimicking the distribution of the target population is of particular relevance when building AI systems. Through empirical demonstrations on US Census data, we evaluate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Data Stream Mining Techniques · Data Analysis with R
