Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges
Maxim Khomiakov, Jes Frellsen

TL;DR
This paper introduces a calibration protocol for LLM-based judges that uses controlled noise interventions to assess and improve their reliability, revealing modality-dependent behaviors and a modality gap.
Contribution
It proposes a novel noise-response calibration protocol using slope-based hypothesis testing for LLM judges, addressing their stochasticity and overconfidence issues.
Findings
Text-based judges degrade predictably under noise.
Tabular data judges often do not show performance deterioration under noise.
Model performance is lower on datasets insensitive to noise interventions.
Abstract
Large language models (LLMs) are increasingly used as automated judges and synthetic labelers, especially in low-label settings. Yet these systems are stochastic and often overconfident, which makes deployment decisions difficult when external ground truth is limited. We propose a practical calibration protocol based on controlled input interventions: if noise severity increases, task performance should exhibit a statistically significant deterioration trend. We operationalize this with a slope-based hypothesis test over repeated trials, using signal-to-noise-ratio (SNR) perturbations for tabular data and lexical perturbations for text data. Across UCI tabular benchmarks and four text classification datasets, we find clear modality-dependent behavior. Our results reveal a modality gap: while text-based judges degrade predictably, the majority of tabular datasets show a lack of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Machine Learning and Data Classification
