Mitigating Distribution Sharpening in Math RLVR via Distribution-Aligned Hint Synthesis and Backward Hint Annealing
Pei-Xi Xie, Che-Yu Lin, Cheng-Lin Yang

TL;DR
This paper introduces a method to improve math reinforcement learning with verifiable rewards by aligning hints with student responses and gradually reducing hint exposure, leading to better performance on challenging questions.
Contribution
The paper proposes Distribution-Aligned Hint Synthesis and Backward Hint Annealing to address distribution mismatch and hint exposure issues in math RLVR training.
Findings
Improves pass@1 and pass@2048 on AIME benchmarks with Qwen3-1.7B-Base.
Enhances large-k performance with Llama-3.2-1B-Instruct.
Hint scaffolding restores learnable updates early and is gradually removed.
Abstract
Reinforcement learning with verifiable rewards (RLVR) can improve low- reasoning accuracy while narrowing solution coverage on challenging math questions, and pass@1 gains do not necessarily translate into better large- performance. Existing hint-based approaches can make challenging questions trainable, but they leave two issues underexplored: teacher-student distribution mismatch and the need to reduce hint exposure to match no-hint evaluation. We address these issues through two components. Distribution-Aligned Hint Synthesis (DAHS) constructs verified teacher hints conditioned on student-style responses. Backward Hint Annealing (BHA) anneals hint exposure across difficulty buckets and uses per-question hint dropout to preserve no-hint updates throughout RL training. We evaluate the method in math RLVR under the DAPO training framework across AIME24, AIME25, and AIME26 using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
