Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport
Andrea Morandi

TL;DR
This paper compares hierarchical Bayesian linear correction and Neural-ODE score transport for de-biasing LLM-based judges, showing their effectiveness depends on data availability and providing deployment guidance.
Contribution
It introduces and empirically evaluates two de-biasing methods for LLM judges, offering a data-driven decision rule for choosing between them.
Findings
Both methods reduce bias to within ±0.08 points with 100 anchors.
Neural-ODE flow outperforms linear correction at 1500 anchors across metrics.
Linear correction saturates below 1500 anchors, while flow improves with more data.
Abstract
[Abridged] Using a Large Language Model (LLM) as an automatic rater (LLM-as-a-judge) is cheap but potentially biased: some judges run lenient, others strict, the middle of the scale gets compressed, and verbose answers may be over-rewarded. A common remedy is post-hoc calibration: leave the cheap judge in place and, on a modest set of paired anchors, fit a transformation from raw judge scores to an estimate of the human rating. We compare two correctors that take opposing views on how this mapping should be modeled: a parametric, small-anchor hierarchical Bayesian linear correction with per-score uncertainty, and a non-parametric Neural-ODE (FFJORD) score-transport flow. Both are run head-to-head on UltraFeedback fine-grained_score (1700 paired examples, 200 held out), with calibration split into three operational sub-questions: population-mean recovery, per-item accuracy, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
