LLMs Are Not a Silver Bullet: A Case Study on Software Fairness
Xinyue Li, Sixuan Li, Ying Xiao, Jie M. Zhang, Zhou Yang, Xuanzhe Liu, Zhenpeng Chen

TL;DR
This study compares ML and LLM methods for bias mitigation in software, finding ML methods generally outperform LLMs, which often rely on limited in-context learning and artificial evaluation settings.
Contribution
It provides a large-scale comparison showing LLMs do not surpass traditional ML methods for fairness, highlighting limitations of current LLM-based bias mitigation approaches.
Findings
ML methods outperform LLMs in fairness and accuracy
Prior LLM studies' gains are due to artificial test data
Supervised fine-tuning of LLMs offers limited advantages
Abstract
Fairness is a critical requirement for human-related, high-stakes software systems, motivating extensive research on bias mitigation. Prior work has largely focused on tabular data settings using traditional Machine Learning (ML) methods. With the rapid rise of Large Language Models (LLMs), recent studies have begun to explore their use for bias mitigation in the same setting. However, it remains unclear whether LLM-based methods offer advantages over traditional ML methods, leaving software engineers without clear guidance for practical adoption. To address this gap, we present a large-scale study comparing state-of-the-art ML- and LLM-based bias mitigation methods. We find that ML-based methods consistently outperform LLM-based methods in both fairness and predictive performance, with even strong LLMs failing to surpass established ML baselines. To understand why prior LLM-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
