Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
Zixiao Zhao, Amirreza Esmaeili, Fatemeh Fard

TL;DR
This paper investigates the reliability and bias of large language models used as judges for code evaluation, revealing significant prompt-induced biases that affect assessment validity and reproducibility.
Contribution
It systematically analyzes prompt biases in LLM-based code judging, highlighting their impact on decision consistency and proposing the need for bias-aware evaluation practices.
Findings
Judge decisions are highly sensitive to prompt biases.
Biases can systematically favor certain options, affecting accuracy.
Prompt artifacts can distort model rankings and conclusions.
Abstract
Large Language Models are increasingly used as judges to evaluate code artifacts when exhaustive human review or executable test coverage is unavailable. LLM-judge is increasingly relevant in agentic software engineering workflows, where it can help rank candidate solutions and guide patch selection. While attractive for scale, current practice lacks a principled account of reliability and bias: repeated evaluations of the same case can disagree; small prompt edits can swing outcomes; and seemingly semantics-preserving, human-equivalent perturbations may elicit divergent verdicts. This paper studies LLM-as-a-Judge for code through a measurement-first lens. We analyze two pointwise judging regimes across code generation, code repair task, and test generation, and we systematically probe prompt-induced biases. Our study considers difficulty levels for repeated runs and controlled prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
