ProPublica's COMPAS Data Revisited
Matias Barenstein

TL;DR
This paper identifies a data processing error in ProPublica's COMPAS dataset that inflates recidivism rates and affects some fairness metrics, highlighting the importance of accurate data handling in algorithmic fairness research.
Contribution
It reveals a critical dataset construction flaw in ProPublica's COMPAS data and demonstrates its impact on recidivism statistics and fairness evaluations.
Findings
Over 40% more recidivists included due to error
Recidivism rate inflated by over 24%
Some statistical measures unaffected by the error
Abstract
I examine the COMPAS recidivism risk score and criminal history data collected by ProPublica in 2016 that fueled intense debate and research in the nascent field of 'algorithmic fairness'. ProPublica's COMPAS data is used in an increasing number of studies to test various definitions of algorithmic fairness. This paper takes a closer look at the actual datasets put together by ProPublica. In particular, the sub-datasets built to study the likelihood of recidivism within two years of a defendant's original COMPAS survey screening date. I take a new yet simple approach to visualize these data, by analyzing the distribution of defendants across COMPAS screening dates. I find that ProPublica made an important data processing error when it created these datasets, failing to implement a two-year sample cutoff rule for recidivists in such datasets (whereas it implemented a two-year sample…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education
