Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes

Ora Nova Fandina; Gal Amram; Eitan Farchi; Shmulik Froimovich; Raviv Gal; Wesam Ibraheem; Rami Katan; Alice Podolsky; and Orna Raz

arXiv:2510.27244·cs.SE·November 3, 2025

Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes

Ora Nova Fandina, Gal Amram, Eitan Farchi, Shmulik Froimovich, Raviv Gal, Wesam Ibraheem, Rami Katan, Alice Podolsky, and Orna Raz

PDF

Open Access

TL;DR

This paper introduces SparseAlign, a framework for validating Large Language Model judges against limited human evaluation data, ensuring reliable model assessment in resource-scarce domains like legacy code modernization.

Contribution

SparseAlign provides a novel method combining pairwise-confidence and score-sensitive metrics for evaluating LaaJ alignment with sparse human labels, improving validation reliability.

Findings

01

SparseAlign effectively identifies well-aligned LaaJs with minimal human data.

02

Application to COBOL code explanation demonstrates practical utility.

03

Guided model release decisions based on validated LaaJs improve assessment accuracy.

Abstract

Application modernization in legacy languages such as COBOL, PL/I, and REXX faces an acute shortage of resources, both in expert availability and in high-quality human evaluation data. While Large Language Models as a Judge (LaaJ) offer a scalable alternative to expert review, their reliability must be validated before being trusted in high-stakes workflows. Without principled validation, organizations risk a circular evaluation loop, where unverified LaaJs are used to assess model outputs, potentially reinforcing unreliable judgments and compromising downstream deployment decisions. Although various automated approaches to validating LaaJs have been proposed, alignment with human judgment remains a widely used and conceptually grounded validation strategy. In many real-world domains, the availability of human-labeled evaluation data is severely limited, making it difficult to assess…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Natural Language Processing Techniques · Model-Driven Software Engineering Techniques