Unlocking the Pre-Trained Model as a Dual-Alignment Calibrator for Post-Trained LLMs

Beier Luo; Cheng Wang; Hongxin Wei; Sharon Li; Xuefeng Du

arXiv:2601.04277·cs.LG·January 9, 2026

Unlocking the Pre-Trained Model as a Dual-Alignment Calibrator for Post-Trained LLMs

Beier Luo, Cheng Wang, Hongxin Wei, Sharon Li, Xuefeng Du

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Dual-Align, an unsupervised post-hoc method that improves confidence calibration of post-trained LLMs by addressing both confidence and process drift through dual alignment strategies, enhancing reliability without sacrificing performance.

Contribution

It proposes a novel dual alignment framework that corrects calibration errors by simultaneously addressing confidence and process drift in post-trained LLMs.

Findings

01

Reduces calibration errors significantly.

02

Approaches supervised oracle performance.

03

Maintains post-training performance gains.

Abstract

Post-training improves large language models (LLMs) but often worsens confidence calibration, leading to systematic overconfidence. Recent unsupervised post-hoc methods for post-trained LMs (PoLMs) mitigate this by aligning PoLM confidence to that of well-calibrated pre-trained counterparts. However, framing calibration as static output-distribution matching overlooks the inference-time dynamics introduced by post-training. In particular, we show that calibration errors arise from two regimes: (i) confidence drift, where final confidence inflates despite largely consistent intermediate decision processes, and (ii) process drift, where intermediate inference pathways diverge. Guided by this diagnosis, we propose Dual-Align, an unsupervised post-hoc framework for dual alignment in confidence calibration. Dual-Align performs confidence alignment to correct confidence drift via…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

- The paper offers an interesting empirical observation by decomposing post-training miscalibration into two distinct regimes (“output drift” and “process drift”). This provides a more granular view of how post-training affects model confidence than prior work, and it may inspire further interpretability-based calibration research. - The proposed method improves calibration with very low computational cost. It only learns a single scalar temperature parameter while exploiting existing model rep

Weaknesses

- The two-drift observation is purely empirical and lacks a principled foundation. The distinction between output and process drift is derived from layer-wise diagnostics rather than foundamental analysis, leaving open questions about whether these patterns generalize to other architectures, training recipes, or datasets. Strengthening this with a more formal analysis or cross-model verification would make the argument more convincing. - The paper does not evaluate potential side effects on tas

Reviewer 02Rating 4Confidence 4

Strengths

1. Clear motivation: The authors identify and thoroughly analyze the process drift overlooked by prior work. 2. Methodological novelty: They propose a new unsupervised algorithm that jointly addresses output drift and process drift, achieving performance that approaches supervised methods. 3. Comprehensive experiments: Evaluations span multiple model scales and families, as well as diverse post-training paradigms.

Weaknesses

When fitting the temperature, the process-drift loss incorporates intermediate-layer logits; however, at inference time only the final layer (i.e., the output) is calibrated. This train–inference mismatch makes the source of the reported gains puzzling. In particular, because inference cannot account for process drift, the learned parameter does not actually stabilize the intermediate layers where such drift occurs, leaving the concrete mechanism behind the performance improvement unclear.

Reviewer 03Rating 4Confidence 3

Strengths

1. The revealing of process drift is refreshing. 2. The design of identifying PDL for process alignment is novel but intuitive. 3. Experimental results seem promising in the perspective of calibration.

Weaknesses

- My primary concern is that this work seeks to align the confidence of PLM and PoLM regardless of whether the confidence discrepancy is expected. Post-training itself is designed to alter model behaviors, and the confidence discrepancy between PLM and PoLM is a natural result, which may also indicate that expected behaviors are successfully injected. This work seems to consider this perspective neither theoretically (e.g. discerning what kind of discrepancy should be mitigated) nor empirically

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)