The Impact of Data Preparation on the Fairness of Software Systems
In\^es Valentim, Nuno Louren\c{c}o, Nuno Antunes

TL;DR
This study investigates how different data preparation techniques influence the fairness and performance of machine learning models, highlighting that removing sensitive attributes alone is insufficient for fairness, especially in imbalanced datasets.
Contribution
It provides a detailed analysis of the impact of data preparation steps on fairness metrics, emphasizing the importance of specific transformations in reducing bias.
Findings
Removing sensitive attributes improves fairness but does not eliminate bias.
Data transformations significantly affect fairness in imbalanced datasets.
Random undersampling can sometimes increase prejudice rather than reduce it.
Abstract
Machine learning models are widely adopted in scenarios that directly affect people. The development of software systems based on these models raises societal and legal concerns, as their decisions may lead to the unfair treatment of individuals based on attributes like race or gender. Data preparation is key in any machine learning pipeline, but its effect on fairness is yet to be studied in detail. In this paper, we evaluate how the fairness and effectiveness of the learned models are affected by the removal of the sensitive attribute, the encoding of the categorical attributes, and instance selection methods (including cross-validators and random undersampling). We used the Adult Income and the German Credit Data datasets, which are widely studied and known to have fairness concerns. We applied each data preparation technique individually to analyse the difference in predictive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
