Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges

Maxim Khomiakov; Jes Frellsen

arXiv:2603.17172·cs.LG·March 19, 2026

Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges

Maxim Khomiakov, Jes Frellsen

PDF

Open Access

TL;DR

This paper introduces a calibration protocol for LLM-based judges that uses controlled noise interventions to assess and improve their reliability, revealing modality-dependent behaviors and a modality gap.

Contribution

It proposes a novel noise-response calibration protocol using slope-based hypothesis testing for LLM judges, addressing their stochasticity and overconfidence issues.

Findings

01

Text-based judges degrade predictably under noise.

02

Tabular data judges often do not show performance deterioration under noise.

03

Model performance is lower on datasets insensitive to noise interventions.

Abstract

Large language models (LLMs) are increasingly used as automated judges and synthetic labelers, especially in low-label settings. Yet these systems are stochastic and often overconfident, which makes deployment decisions difficult when external ground truth is limited. We propose a practical calibration protocol based on controlled input interventions: if noise severity increases, task performance should exhibit a statistically significant deterioration trend. We operationalize this with a slope-based hypothesis test over repeated trials, using signal-to-noise-ratio (SNR) perturbations for tabular data and lexical perturbations for text data. Across UCI tabular benchmarks and four text classification datasets, we find clear modality-dependent behavior. Our results reveal a modality gap: while text-based judges degrade predictably, the majority of tabular datasets show a lack of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Machine Learning and Data Classification