On the Impact of Random Seeds on the Fairness of Clinical Classifiers
Silvio Amir, Jan-Willem van de Meent, Byron C. Wallace

TL;DR
This study investigates how random seed variability affects the fairness of clinical classifiers trained on EHR data, revealing significant subgroup performance fluctuations and emphasizing the need to consider stochastic effects in fairness assessments.
Contribution
It demonstrates the impact of random seed choices on fairness metrics in clinical NLP models and highlights the limitations of current disparity estimation methods due to small sample sizes.
Findings
Subgroup performance varies with different random seeds.
No evidence of a trade-off between overall accuracy and fairness.
Joint optimization for fairness and accuracy shows no significant gains.
Abstract
Recent work has shown that fine-tuning large networks is surprisingly sensitive to changes in random seed(s). We explore the implications of this phenomenon for model fairness across demographic groups in clinical prediction tasks over electronic health records (EHR) in MIMIC-III -- the standard dataset in clinical NLP research. Apparent subgroup performance varies substantially for seeds that yield similar overall performance, although there is no evidence of a trade-off between overall and subgroup performance. However, we also find that the small sample sizes inherent to looking at intersections of minority groups and somewhat rare conditions limit our ability to accurately estimate disparities. Further, we find that jointly optimizing for high overall performance and low disparities does not yield statistically significant improvements. Our results suggest that fairness work using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
