From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation
Minh Duc Bui, Xenia Heilmann, Mattia Cerrato, Manuel Mager, Katharina von der Wense

TL;DR
This paper reveals that bias in code-generation for machine learning pipelines is significantly underestimated by simple conditional tests, as real-world pipeline generation shows much higher bias prevalence.
Contribution
It demonstrates that existing bias evaluation methods are inadequate by analyzing bias in ML pipeline generation, revealing much higher bias prevalence than simple conditionals suggest.
Findings
Generated ML pipelines show sensitive attributes in 87.7% of cases.
Bias prevalence in pipelines is higher than in simple conditional statements.
Results are consistent across mitigation strategies and pipeline complexities.
Abstract
Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bias. We demonstrate that this approach dramatically underestimates bias in practice by examining a more realistic task: generating machine learning (ML) pipelines. Testing both code-specialized and general-instruction large language models, we find that generated pipelines exhibit significant bias during feature selection. Sensitive attributes appear in 87.7% of cases on average, despite models demonstrably excluding irrelevant features (e.g., including "race" while dropping "favorite color" for credit scoring). This bias is substantially more prevalent than that captured by conditional statements, where sensitive attributes appear in only 59.2% of cases. These findings are robust across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
