AXIOM: Benchmarking LLM-as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration

Ruiqi Wang; Xinchen Wang; Cuiyun Gao; Chun Yong Chong; Xin Xia; Qing Liao

arXiv:2512.20159·cs.SE·December 24, 2025

AXIOM: Benchmarking LLM-as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration

Ruiqi Wang, Xinchen Wang, Cuiyun Gao, Chun Yong Chong, Xin Xia, Qing Liao

PDF

Open Access

TL;DR

AXIOM introduces a perturbation-based framework to create balanced, scalable code evaluation benchmarks for LLMs, improving reliability and diversity in assessing code quality.

Contribution

It proposes a novel perturbation and calibration framework to synthesize diverse, well-balanced code evaluation benchmarks for LLMs, addressing limitations of existing metrics.

Findings

01

AXIOM generates balanced score distributions across code quality levels.

02

It enables scalable and precise benchmark creation through rule-based perturbations.

03

The framework improves the reliability of LLM evaluation in code assessment.

Abstract

Large language models (LLMs) have been increasingly deployed in real-world software engineering, fostering the development of code evaluation metrics to study the quality of LLM-generated code. Conventional rule-based metrics merely score programs based on their surface-level similarities with reference programs instead of analyzing functionality and code quality in depth. To address this limitation, researchers have developed LLM-as-a-judge metrics, prompting LLMs to evaluate and score code, and curated various code evaluation benchmarks to validate their effectiveness. However, these benchmarks suffer from critical limitations, hindering reliable assessments of evaluation capability: Some feature coarse-grained binary labels, which reduce rich code behavior to a single bit of information, obscuring subtle errors. Others propose fine-grained but subjective, vaguely-defined evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques