Underrepresentation, Label Bias, and Proxies: Towards Data Bias Profiles for the EU AI Act and Beyond

Marina Ceccon; Giandomenico Cornacchia; Davide Dalle Pezze; Alessandro Fabris; Gian Antonio Susto

arXiv:2507.08866·cs.LG·July 15, 2025

Underrepresentation, Label Bias, and Proxies: Towards Data Bias Profiles for the EU AI Act and Beyond

Marina Ceccon, Giandomenico Cornacchia, Davide Dalle Pezze, Alessandro Fabris, Gian Antonio Susto

PDF

TL;DR

This paper investigates common data biases affecting algorithmic fairness, introduces the Data Bias Profile (DBP) to systematically document these biases, and demonstrates its effectiveness in predicting discriminatory risks and guiding fairness interventions.

Contribution

It identifies and studies three key data biases, develops the DBP as a systematic bias detection tool, and bridges fairness research with anti-discrimination policy.

Findings

01

Underrepresentation of vulnerable groups is less linked to discrimination than proxies and label bias.

02

The DBP effectively predicts discriminatory risks across datasets.

03

Fairness interventions benefit from bias profiling insights.

Abstract

Undesirable biases encoded in the data are key drivers of algorithmic discrimination. Their importance is widely recognized in the algorithmic fairness literature, as well as legislation and standards on anti-discrimination in AI. Despite this recognition, data biases remain understudied, hindering the development of computational best practices for their detection and mitigation. In this work, we present three common data biases and study their individual and joint effect on algorithmic discrimination across a variety of datasets, models, and fairness measures. We find that underrepresentation of vulnerable populations in training sets is less conducive to discrimination than conventionally affirmed, while combinations of proxies and label bias can be far more critical. Consequently, we develop dedicated mechanisms to detect specific types of bias, and combine them into a preliminary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.