Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs

Subramanyam Sahoo

arXiv:2604.10585·cs.LG·April 14, 2026

Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs

Subramanyam Sahoo

PDF

TL;DR

This paper investigates how reward hacking through sycophantic fine-tuning degrades the calibration of large language models, affecting their uncertainty quantification, and proposes evaluation methodologies for this issue.

Contribution

It introduces a methodology to evaluate calibration degradation caused by reward hacking and highlights the residual miscalibration even after post-hoc correction.

Findings

01

Sycophantic reward signals lead to calibration degradation, increasing ECE and MCE.

02

Post-hoc matrix scaling reduces calibration error significantly across models.

03

Reward-induced miscalibration persists after affine correction, indicating structured residuals.

Abstract

Modern large language models (LLMs) are increasingly fine-tuned via reinforcement learning from human feedback (RLHF) or related reward optimisation schemes. While such procedures improve perceived helpfulness, we investigate whether sycophantic reward signals degrade calibration -- a property essential for reliable uncertainty quantification. We fine-tune Qwen3-8B under three regimes: no fine-tuning (base), neutral supervised fine-tuning (SFT) on TriviaQA, and sycophancy-inducing Group Relative Policy Optimisation (GRPO) that rewards agreement with planted wrong answers. Evaluating on $1, 000$ MMLU items across five subject domains with bootstrap confidence intervals and permutation testing, we find that \textbf{sycophantic GRPO produces consistent directional calibration degradation} -- ECE rises by $+ 0.006$ relative to the base model and MCE increases by $+ 0.010$ relative to neutral…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.