Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation

Tzu-Heng Huang; Harit Vishwakarma; Frederic Sala

arXiv:2506.10403·cs.LG·June 13, 2025

Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation

Tzu-Heng Huang, Harit Vishwakarma, Frederic Sala

PDF

Open Access

TL;DR

PAJAMA introduces a cost-effective, interpretable, and less biased evaluation method for LLM responses by synthesizing executable judging programs, outperforming traditional LLM-based judges on benchmark datasets.

Contribution

The paper presents PAJAMA, a novel approach that replaces direct LLM scoring with synthesized executable judging programs, enhancing interpretability, reducing bias, and lowering evaluation costs.

Findings

01

Program-based judges improve judgment consistency by 15.83%.

02

Bias in evaluations is reduced by 23.7%.

03

PAJAMA outperforms LLM-as-a-judge on RewardBench metrics.

Abstract

Large language models (LLMs) are widely used to evaluate the quality of LLM generations and responses, but this leads to significant challenges: high API costs, uncertain reliability, inflexible pipelines, and inherent biases. To address these, we introduce PAJAMA (Program-As-a-Judge for Automated Model Assessment), a new alternative that uses LLMs to synthesize executable judging programs instead of directly scoring responses. These synthesized programs can be stored and run locally, costing orders of magnitude less while providing interpretable, and auditable judging logic that can be easily adapted. Program-based judges mitigate biases, improving judgment consistency by 15.83% and reducing biased responses by 23.7% on average compared to a Qwen2.5-14B-based LLM-as-a-judge. When program judgments are distilled into a model, PAJAMA outperforms LLM-as-a-judge on the challenging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Software Engineering Research