PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners

Zhiquan Tan; Yinrong Hong

arXiv:2604.26573·cs.LG·April 30, 2026

PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners

Zhiquan Tan, Yinrong Hong

PDF

TL;DR

PAINT introduces a novel training method for LLM reasoning that adaptively masks and interpolates verified solutions, leading to consistent improvements on math benchmarks.

Contribution

It proposes PAINT, a new approach that dynamically masks and interpolates solutions during self-distillation to enhance reasoning performance.

Findings

01

PAINT improves macro Avg@12 by 2.1 points over the prior baseline on Qwen3-8B.

02

PAINT outperforms GRPO by 2.9 points on the same benchmark.

03

Consistent gains observed across multiple Qwen3 scales.

Abstract

Improving large language model (LLM) reasoning requires supervision that is both aligned with the model's own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exploration but offers sparse, high-variance credit; supervised fine-tuning and distillation provide dense targets but often train on fixed trajectories or rely on stronger teachers. Recent privileged on-policy self-distillation explores a middle ground by scoring student rollouts with the same model under verified solution context. We revisit this setting through a contextual re-scoring lens: for reasoning, the important choices are not only whether privileged context is available, but how much of it should be revealed and where its distribution should shape the student. We propose PAINT (Partial-solution Adaptive INterpolated Training), which masks the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.