Quantitative LLM Judges

Aishwarya Sahoo; Jeevana Kruthi Karnuthala; Tushar Parmanand Budhwani; Pranchal Agarwal; Sankaran Vaidyanathan; Alexa Siu; Franck Dernoncourt; Jennifer Healey; Nedim Lipka; Ryan Rossi; Uttaran Bhattacharya; and Branislav Kveton

arXiv:2506.02945·cs.CL·October 24, 2025

Quantitative LLM Judges

Aishwarya Sahoo, Jeevana Kruthi Karnuthala, Tushar Parmanand Budhwani, Pranchal Agarwal, Sankaran Vaidyanathan, Alexa Siu, Franck Dernoncourt, Jennifer Healey, Nedim Lipka, Ryan Rossi, Uttaran Bhattacharya, and Branislav Kveton

PDF

Open Access 3 Reviews

TL;DR

This paper introduces quantitative LLM judges that use regression models to better align LLM evaluation scores with human preferences, improving predictive accuracy efficiently across various feedback types.

Contribution

The paper proposes a novel framework for quantitative LLM judges that enhances evaluation score alignment with humans using regression, without requiring extensive fine-tuning.

Findings

01

Quantitative judges improve the predictive power of existing LLM judges.

02

The framework is more computationally efficient than supervised fine-tuning.

03

Validated on four datasets with two base judges, showing consistent improvements.

Abstract

LLM-as-a-judge is a framework where a large language model (LLM) evaluates the output of another LLM. While LLMs excel at producing qualitative textual evaluations, they often struggle to predict human preferences and numeric scores. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain using regression models. The models are trained to improve the score of the original judge using its rationale and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in practice. We validate these claims empirically on four datasets using two base judges. Our…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- The paper is clearly written and well structured, with comprehensive coverage of both model architectures and datasets. - The proposed approach is conceptually straightforward and easy to follow.

Weaknesses

- The novelty of the proposed method appears limited. Prior research has already explored the use of trainable adapters or probing modules to enhance evaluation accuracy. - The experimental comparison is relatively weak. The method is compared with only one SFT baseline. To convincingly demonstrate superior performance, the paper should include a broader set of baselines—both training-free calibration methods (e.g., G-Eval) and approaches that incorporate additional training modules, such as li

Reviewer 02Rating 4Confidence 3

Strengths

- The calibration framework presented is simple, post hoc, and easy to implement. - It is (pleasantly) surprising that a lightweight model such as ordinary least squares is sufficient to calibrate LLM-judges to human evaluations. In a similar vein, it is nice that this method is post hoc (efficient to optimize a GLM, no fine-tuning required). The proposed calibration framework is simple, elegant, and easy to implement. - The paper is very well motivated, addressing the important and timely con

Weaknesses

-The proposed quantitative judges are trained and evaluated on datasets labeled by the same human preferences used to assess their alignment. This setup is circular: models are optimized to reproduce human scores and then evaluated on their ability to approximate those same scores. To ensure that the method is not simply overfitting to the human labels, a stronger experiment might evaluate calibration on held-out datasets (can be in a similar domain). -Dependence on Human Evaluations: The appro

Reviewer 03Rating 6Confidence 4

Strengths

1. The proposed framework is highly generalizable, as it relies only on rationales and scores and can therefore be applied to any base LLM judge. 2. The paper effectively leverages the original rationales and scores by training a small model to predict the final score, improving alignment with human evaluations in a computationally and data-efficient manner. 3. The paper tests various types of small models, such as the least-squares (LS) judge, multinomial (MN) judge, Bradley–Terry–Luce (BTL) ju

Weaknesses

1. It is helpful to include experiments that use a larger and more complex model for the quantitative judges, such as a two-layer MLP. 2. The paper currently lacks ablation studies and analyses investigating the impact of the rationale embedding and the initial score. In particular, it is necessary to include experiments that use only rationale embeddings and those that use only the score.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDispute Resolution and Class Actions · Legal Systems and Judicial Processes · Legal Education and Practice Innovations