Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes
Ora Nova Fandina, Gal Amram, Eitan Farchi, Shmulik Froimovich, Raviv Gal, Wesam Ibraheem, Rami Katan, Alice Podolsky, and Orna Raz

TL;DR
This paper introduces SparseAlign, a framework for validating Large Language Model judges against limited human evaluation data, ensuring reliable model assessment in resource-scarce domains like legacy code modernization.
Contribution
SparseAlign provides a novel method combining pairwise-confidence and score-sensitive metrics for evaluating LaaJ alignment with sparse human labels, improving validation reliability.
Findings
SparseAlign effectively identifies well-aligned LaaJs with minimal human data.
Application to COBOL code explanation demonstrates practical utility.
Guided model release decisions based on validated LaaJs improve assessment accuracy.
Abstract
Application modernization in legacy languages such as COBOL, PL/I, and REXX faces an acute shortage of resources, both in expert availability and in high-quality human evaluation data. While Large Language Models as a Judge (LaaJ) offer a scalable alternative to expert review, their reliability must be validated before being trusted in high-stakes workflows. Without principled validation, organizations risk a circular evaluation loop, where unverified LaaJs are used to assess model outputs, potentially reinforcing unreliable judgments and compromising downstream deployment decisions. Although various automated approaches to validating LaaJs have been proposed, alignment with human judgment remains a widely used and conceptually grounded validation strategy. In many real-world domains, the availability of human-labeled evaluation data is severely limited, making it difficult to assess…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Natural Language Processing Techniques · Model-Driven Software Engineering Techniques
