Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge

Hamid Dadkhahi; Firas Trabelsi; Parker Riley; Juraj Juraska; Mehdi Mirzazadeh

arXiv:2512.03019·cs.LG·December 3, 2025

Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge

Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, Mehdi Mirzazadeh

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a distribution-calibrated aggregation method for large language models acting as judges, improving the reliability of pairwise preference evaluations by modeling preferences with a Bradley-Terry-Davidson approach and leveraging multiple samples.

Contribution

It proposes a novel distribution-calibrated aggregation scheme for inference-time compute, enhancing the accuracy and reliability of LLM-based judgments in preference evaluation.

Findings

01

Reduces mean absolute error in preference ratings

02

Increases pairwise accuracy over standard baselines

03

Matches or exceeds human raters in consensus evaluation

Abstract

Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 4Confidence 3

Strengths

1, The paper identifies and addresses two understudied gaps in LLM-as-a-Judge research—tie instability across models/prompts and the loss of evidential strength in simple aggregation. The BTD-based distribution-calibrated scheme is a creative adaptation of statistical preference modeling to LLM evaluation, filling a critical methodological gap. 2, Experiments are rigorous and comprehensive. Testing across 3 LLMs (gemini-2.5-flash, qwen3-next-80b, gpt-oss-120b) and 8 benchmarks ensures generali

Weaknesses

1, More related works should be discussed. e.g. https://aclanthology.org/2024.findings-emnlp.135.pdf, https://arxiv.org/abs/2401.02009, https://arxiv.org/abs/2308.00436. For example, at the same cost, does the proposed method perform better than mirror-consistency, self-contrast & self-check? 2, Generating samples and fitting the BTD model adds computational cost. The paper does not quantify this overhead (e.g., inference time, token usage) relative to baselines e.g., Self-Consistency with n=1

Reviewer 02Rating 6Confidence 4

Strengths

- The paper is good in presentation, though it seems very obvious that the authors are trying very hard to stretch their content to 9 pages. - It is quite novel to reframe LLM judgement aggregation using a Bradely Terry Model. - The experiments show improvements over baselines.

Weaknesses

- In the motivation section, the author appears to make a very strong claim that 'Ties are important to reduce LLM biases. ' Though it reduces the score in numbers, it doesn't really solve the fundamental problem of the model computation mechanism that directly related to position bias. As noted by Wang et al. [1], the core problem is the mechanism of position bias. Because of this, the claim of 'Ties are important to reduce LLM biases' is way too strong. Instead of reducing it, it seems more l

Reviewer 03Rating 4Confidence 4

Strengths

1. Clear research motivation addressing the importance of tie decisions in reducing LLM biases and their instability for LLM-as-a-Judge. 2. The method is theoretically rigorous and easy to understand. 3. Sufficient and broad experimental validation across MT and Reward Bench tasks.

Weaknesses

1. In comparison to the baselines, the paper uses an additional calibration set (5%-10% of test data). This indicates additional human effort when using the method for different tasks. Is it possible to fit BTD on one task and test on the other task? Or will the parameter of BTD be very sensitive to the selected task? 2. Based on argument 1, a sensitive analysis should be presented if the proposed method is not fitted with the best parameters. 3. Is thinking important? The proposed method is app

Reviewer 04Rating 8Confidence 4

Strengths

- The paper is well-motivated and effectively presented. It clearly articulates the problem with existing aggregation methods, particularly their failure to handle ties and distributional information gracefully. The manuscript, while concise, is complete and easy to follow. - The proposed method is well-grounded in established statistical principles, using a Bradley–Terry-Davidson formulation to model three-way preferences. This provides a strong theoretical foundation for the approach. - The

Weaknesses

See questions.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Mobile Crowdsensing and Crowdsourcing · Explainable Artificial Intelligence (XAI)