GIFT: Reconciling Post-Training Objectives via Finite-Temperature Gibbs Initialization

Zhengyang Zhao; Lu Ma; Yizhen Jiang; Xiaochen Ma; Zimo Meng; Chengyu Shen; Lexiang Tang; Haoze Sun; Peng Pei; Wentao Zhang

arXiv:2601.09233·cs.LG·March 19, 2026

GIFT: Reconciling Post-Training Objectives via Finite-Temperature Gibbs Initialization

Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng, Chengyu Shen, Lexiang Tang, Haoze Sun, Peng Pei, Wentao Zhang

PDF

Open Access

TL;DR

GIFT introduces a finite-temperature Gibbs initialization method that aligns supervised fine-tuning with reinforcement learning, enhancing exploration and performance in large reasoning models.

Contribution

The paper proposes GIFT, a novel initialization approach that bridges SFT and RL by incorporating supervision as a finite-temperature energy potential.

Findings

01

GIFT outperforms standard SFT in RL tasks.

02

GIFT preserves exploration space during post-training.

03

The method provides a principled way to align training objectives.

Abstract

The prevailing post-training paradigm for Large Reasoning Models (LRMs) - Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) - suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT to reconcile post-training objectives and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero-temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite-temperature energy potential, establishing a distributional bridge that promotes objective consistency throughout the post-training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Stochastic Gradient Optimization Techniques