Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation
Tzu-Heng Huang, Harit Vishwakarma, Frederic Sala

TL;DR
PAJAMA introduces a cost-effective, interpretable, and less biased evaluation method for LLM responses by synthesizing executable judging programs, outperforming traditional LLM-based judges on benchmark datasets.
Contribution
The paper presents PAJAMA, a novel approach that replaces direct LLM scoring with synthesized executable judging programs, enhancing interpretability, reducing bias, and lowering evaluation costs.
Findings
Program-based judges improve judgment consistency by 15.83%.
Bias in evaluations is reduced by 23.7%.
PAJAMA outperforms LLM-as-a-judge on RewardBench metrics.
Abstract
Large language models (LLMs) are widely used to evaluate the quality of LLM generations and responses, but this leads to significant challenges: high API costs, uncertain reliability, inflexible pipelines, and inherent biases. To address these, we introduce PAJAMA (Program-As-a-Judge for Automated Model Assessment), a new alternative that uses LLMs to synthesize executable judging programs instead of directly scoring responses. These synthesized programs can be stored and run locally, costing orders of magnitude less while providing interpretable, and auditable judging logic that can be easily adapted. Program-based judges mitigate biases, improving judgment consistency by 15.83% and reducing biased responses by 23.7% on average compared to a Qwen2.5-14B-based LLM-as-a-judge. When program judgments are distilled into a model, PAJAMA outperforms LLM-as-a-judge on the challenging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Software Engineering Research
