The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation
Mingyi Liu

TL;DR
Aligned language models tend to produce homogenized responses, reducing uncertainty estimation effectiveness, with this effect varying across tasks, model families, and scales, and can be mitigated through selective prediction strategies.
Contribution
This paper identifies and characterizes the response homogenization (alignment tax) in aligned LLMs, demonstrating its task-dependent nature and proposing a cost-effective mitigation approach.
Findings
40-79% of responses form a single semantic cluster on TruthfulQA.
Sampling-based uncertainty methods lose discriminative power on homogenized responses.
Selective prediction improves GSM8K accuracy from 84.4% to 93.2% at 50% coverage.
Abstract
RLHF-aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40-79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling-based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task-dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen's d=0.81). A base-vs-instruct ablation confirms the causal role of alignment: the base model shows 1.0% single-cluster rate vs. 28.5% for the instruct model (p < 10^{-6}). A training stage ablation (Base 0.0% -> SFT 1.5% -> DPO 4.0% SCR) localizes the cause to DPO, not SFT. Cross-family replication on four model families reveals alignment tax severity varies by family and scale. We validate across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B-14B), with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
