Does Alignment Tuning Really Break LLMs' Internal Confidence?
Hongseok Oh, Wonseok Hwang

TL;DR
This paper investigates how alignment tuning affects the calibration of LLMs, revealing that it generally harms confidence accuracy and emphasizing the need for methods that preserve both alignment and calibration.
Contribution
It provides a comprehensive analysis of calibration degradation due to alignment tuning across multiple dimensions and highlights the importance of careful confidence measurement.
Findings
Alignment tuning often degrades LLM calibration.
Calibration and alignment are not always a trade-off, but tend to conflict under strict analysis.
Future algorithms should aim to improve both calibration and instruction-following.
Abstract
Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration. This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods. Initial analysis showed that the relationship between alignment and calibration is not always a trade-off, but under stricter analysis conditions, we found the alignment process consistently harms calibration. This highlights the need for (1) a careful approach when measuring model confidences and calibration errors and (2) future research into algorithms that can help LLMs to achieve both instruction-following and calibration without sacrificing either.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security
