Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs
Subramanyam Sahoo

TL;DR
This paper investigates how reward hacking through sycophantic fine-tuning degrades the calibration of large language models, affecting their uncertainty quantification, and proposes evaluation methodologies for this issue.
Contribution
It introduces a methodology to evaluate calibration degradation caused by reward hacking and highlights the residual miscalibration even after post-hoc correction.
Findings
Sycophantic reward signals lead to calibration degradation, increasing ECE and MCE.
Post-hoc matrix scaling reduces calibration error significantly across models.
Reward-induced miscalibration persists after affine correction, indicating structured residuals.
Abstract
Modern large language models (LLMs) are increasingly fine-tuned via reinforcement learning from human feedback (RLHF) or related reward optimisation schemes. While such procedures improve perceived helpfulness, we investigate whether sycophantic reward signals degrade calibration -- a property essential for reliable uncertainty quantification. We fine-tune Qwen3-8B under three regimes: no fine-tuning (base), neutral supervised fine-tuning (SFT) on TriviaQA, and sycophancy-inducing Group Relative Policy Optimisation (GRPO) that rewards agreement with planted wrong answers. Evaluating on MMLU items across five subject domains with bootstrap confidence intervals and permutation testing, we find that \textbf{sycophantic GRPO produces consistent directional calibration degradation} -- ECE rises by relative to the base model and MCE increases by relative to neutral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
