Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport

Andrea Morandi

arXiv:2605.09227·cs.CL·May 12, 2026

Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport

Andrea Morandi

PDF

TL;DR

This paper compares hierarchical Bayesian linear correction and Neural-ODE score transport for de-biasing LLM-based judges, showing their effectiveness depends on data availability and providing deployment guidance.

Contribution

It introduces and empirically evaluates two de-biasing methods for LLM judges, offering a data-driven decision rule for choosing between them.

Findings

01

Both methods reduce bias to within ±0.08 points with 100 anchors.

02

Neural-ODE flow outperforms linear correction at 1500 anchors across metrics.

03

Linear correction saturates below 1500 anchors, while flow improves with more data.

Abstract

[Abridged] Using a Large Language Model (LLM) as an automatic rater (LLM-as-a-judge) is cheap but potentially biased: some judges run lenient, others strict, the middle of the scale gets compressed, and verbose answers may be over-rewarded. A common remedy is post-hoc calibration: leave the cheap judge in place and, on a modest set of paired anchors, fit a transformation from raw judge scores to an estimate of the human rating. We compare two correctors that take opposing views on how this mapping should be modeled: a parametric, small-anchor hierarchical Bayesian linear correction with per-score uncertainty, and a non-parametric Neural-ODE (FFJORD) score-transport flow. Both are run head-to-head on UltraFeedback fine-grained_score (1700 paired examples, 200 held out), with calibration split into three operational sub-questions: population-mean recovery, per-item accuracy, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.