Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation
Jiwon Moon, Yerin Hwang, Dongryeol Lee, Taegwan Kang, Yongil Kim, Kyomin Jung

TL;DR
This paper investigates biases in large language model judges for code evaluation, revealing their susceptibility to superficial variations and biases across multiple languages, which impacts fairness and reliability.
Contribution
It provides the first comprehensive analysis of biases in LLM-based code evaluation, identifying six bias types and demonstrating their systematic effects across various models and languages.
Findings
LLM judges are biased by superficial code variations.
Biases lead to inflated or unfairly low scores.
Prompting for test cases does not eliminate biases.
Abstract
With the growing use of large language models(LLMs) as evaluators, their application has expanded to code evaluation tasks, where they assess the correctness of generated code without relying on reference implementations. While this offers scalability and flexibility, it also raises a critical, unresolved question: Can LLM judges fairly and robustly evaluate semantically equivalent code with superficial variations? Functionally correct code often exhibits variations-such as differences in variable names, comments, or formatting-that should not influence its correctness. Yet, whether LLM judges can reliably handle these variations remains unclear. We present the first comprehensive study of this issue, defining six types of potential bias in code evaluation and revealing their systematic impact on LLM judges. Across five programming languages and multiple LLMs, we empirically demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Authorship Attribution and Profiling · Computational and Text Analysis Methods
