Automated Validation of LLM-based Evaluators for Software Engineering Artifacts
Ora Nova Fandina, Eitan Farchi, Shmulik Froimovich, Rami Katan, Alice Podolsky, Orna Raz, Avi Ziv

TL;DR
This paper presents REFINE, a novel automated framework for benchmarking and tuning LLM-based evaluators in software engineering, improving their ability to discern fine-grained quality differences in code artifacts.
Contribution
Introduction of REFINE, a controllable, automated benchmarking framework that synthesizes artifacts with varying quality to evaluate and enhance LLM evaluators in real-world software tasks.
Findings
REFINE improved evaluator alignment scores from below 0.7 to above 0.9.
Framework successfully identified nuanced evaluator configurations for enterprise code tasks.
Evaluators are now actively used in IBM's model release decisions.
Abstract
Automation in software engineering increasingly relies on large language models (LLMs) to generate, review, and assess code artifacts. However, establishing LLMs as reliable evaluators remains an open challenge: human evaluations are costly, subjective and non scalable, while existing automated methods fail to discern fine grained variations in artifact quality. We introduce REFINE (Ranking Evaluators for FIne grained Nuanced Evaluation), an automated framework for benchmarking LLM based evaluators across software engineering tasks. REFINE comprises of two modules: Hierarchy Dataset Builder applies novel generation techniques to automatically synthesize artifacts with progressively reduced quality, and Evaluator Tester quantifies each candidate evaluator configuration by measuring how closely its rankings align with expected ordering. A key feature of REFINE is controllability:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
