Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

Wenye Lin; Kai Han

arXiv:2603.01683·cs.CL·May 19, 2026

Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

Wenye Lin, Kai Han

PDF

1 Repo 1 Models

TL;DR

SPOT is a proximal on-policy distillation method that enhances reasoning in LLMs while effectively retaining prior knowledge, using minimal data rectification and reward-based training.

Contribution

It introduces a novel framework combining data rectification and reward-based objectives to improve reasoning and knowledge retention during post-training.

Findings

01

SPOT improves Qwen3-8B accuracy by 6.2% with only 4k rectified math pairs.

02

Requires only 16-minute training on 8x H800 GPUs.

03

Provides a better initialization for reinforcement learning.

Abstract

Injecting new reasoning knowledge into Large Language Models (LLMs) via post-training often induces catastrophic forgetting. Recent studies emphasize the importance of on-policy data but suggest that KL-divergence fails to mitigate forgetting. In contrast, we show, both analytically and empirically, that the KL-constrained reward formulation actually plays a critical role in retaining knowledge during post-training. This motivates our Surgical Post-Training (SPOT), a proximal on-policy distillation framework designed to optimize reasoning efficiently while preserving prior knowledge. SPOT consists of (1) a data rectification pipeline employing an Oracle to surgically correct erroneous steps via minimal edits, generating proximal on-policy data; and (2) a reward-based binary cross-entropy objective essential for enhancing reasoning and mitigating forgetting. Empirically, with only 4k…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Visual-AI/SPoT
github

Models

🤗
linius/Qwen3-8B-SPoT
model· 6 dl· ♡ 2
6 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Artificial Intelligence in Healthcare and Education