PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners
Zhiquan Tan, Yinrong Hong

TL;DR
PAINT introduces a novel training method for LLM reasoning that adaptively masks and interpolates verified solutions, leading to consistent improvements on math benchmarks.
Contribution
It proposes PAINT, a new approach that dynamically masks and interpolates solutions during self-distillation to enhance reasoning performance.
Findings
PAINT improves macro Avg@12 by 2.1 points over the prior baseline on Qwen3-8B.
PAINT outperforms GRPO by 2.9 points on the same benchmark.
Consistent gains observed across multiple Qwen3 scales.
Abstract
Improving large language model (LLM) reasoning requires supervision that is both aligned with the model's own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exploration but offers sparse, high-variance credit; supervised fine-tuning and distillation provide dense targets but often train on fixed trajectories or rely on stronger teachers. Recent privileged on-policy self-distillation explores a middle ground by scoring student rollouts with the same model under verified solution context. We revisit this setting through a contextual re-scoring lens: for reasoning, the important choices are not only whether privileged context is available, but how much of it should be revealed and where its distribution should shape the student. We propose PAINT (Partial-solution Adaptive INterpolated Training), which masks the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
