Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models

Hao Wang; Hao Gu; Hongming Piao; Kaixiong Gong; Yuxiao Ye; Xiangyu Yue; Sirui Han; Yike Guo; Dapeng Wu

arXiv:2602.02244·cs.LG·February 10, 2026

Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models

Hao Wang, Hao Gu, Hongming Piao, Kaixiong Gong, Yuxiao Ye, Xiangyu Yue, Sirui Han, Yike Guo, Dapeng Wu

PDF

Open Access 1 Models

TL;DR

This paper introduces CurioSFT, an entropy-preserving supervised fine-tuning method that enhances exploration in large reasoning models, leading to improved performance in both fine-tuning and reinforcement learning stages.

Contribution

It proposes a novel entropy-preserving SFT approach with self-exploratory distillation and adaptive temperature selection to improve exploration and factual stability.

Findings

01

CurioSFT outperforms vanilla SFT by 2.5 and 2.9 points on in- and out-of-distribution tasks.

02

Enhanced exploration during SFT leads to a 5.0 point average improvement in RL stage.

03

The method effectively balances exploration and factual stability in large reasoning models.

Abstract

The standard post-training recipe for large reasoning models, supervised fine-tuning followed by reinforcement learning (SFT-then-RL), may limit the benefits of the RL stage: while SFT imitates expert demonstrations, it often causes overconfidence and reduces generation diversity, leaving RL with a narrowed solution space to explore. Adding entropy regularization during SFT is not a cure-all; it tends to flatten token distributions toward uniformity, increasing entropy without improving meaningful exploration capability. In this paper, we propose CurioSFT, an entropy-preserving SFT method designed to enhance exploration capabilities through intrinsic curiosity. It consists of (a) Self-Exploratory Distillation, which distills the model toward a self-generated, temperature-scaled teacher to encourage exploration within its capability; and (b) Entropy-Guided Temperature Selection, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Hao0oWang/Qwen2.5-Math-7B-16k-think
model· 4 dl
4 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)