TL;DR
SPOT is a proximal on-policy distillation method that enhances reasoning in LLMs while effectively retaining prior knowledge, using minimal data rectification and reward-based training.
Contribution
It introduces a novel framework combining data rectification and reward-based objectives to improve reasoning and knowledge retention during post-training.
Findings
SPOT improves Qwen3-8B accuracy by 6.2% with only 4k rectified math pairs.
Requires only 16-minute training on 8x H800 GPUs.
Provides a better initialization for reinforcement learning.
Abstract
Injecting new reasoning knowledge into Large Language Models (LLMs) via post-training often induces catastrophic forgetting. Recent studies emphasize the importance of on-policy data but suggest that KL-divergence fails to mitigate forgetting. In contrast, we show, both analytically and empirically, that the KL-constrained reward formulation actually plays a critical role in retaining knowledge during post-training. This motivates our Surgical Post-Training (SPOT), a proximal on-policy distillation framework designed to optimize reasoning efficiently while preserving prior knowledge. SPOT consists of (1) a data rectification pipeline employing an Oracle to surgically correct erroneous steps via minimal edits, generating proximal on-policy data; and (2) a reward-based binary cross-entropy objective essential for enhancing reasoning and mitigating forgetting. Empirically, with only 4k…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Artificial Intelligence in Healthcare and Education
