Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging
Tiancheng Hu, Benjamin Minixhofer, Nigel Collier

TL;DR
This paper introduces a simple post-hoc model merging technique that navigates the alignment-calibration trade-off, producing models that outperform their parents in accuracy and calibration, thus mitigating the alignment tax efficiently.
Contribution
It demonstrates that interpolating between pre- and post-alignment models reveals Pareto-optimal solutions, improving both accuracy and calibration beyond individual models.
Findings
Interpolating models recovers calibration lost during alignment.
Model merging reveals Pareto-optimal trade-offs.
Merged models outperform individual models in accuracy.
Abstract
The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse. We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model's weights before and after alignment. Crucially, this is not a strict trade-off. We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment. Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics · Human Pose and Action Recognition
