Mitigating Distribution Sharpening in Math RLVR via Distribution-Aligned Hint Synthesis and Backward Hint Annealing

Pei-Xi Xie; Che-Yu Lin; Cheng-Lin Yang

arXiv:2604.07747·cs.AI·April 10, 2026

Mitigating Distribution Sharpening in Math RLVR via Distribution-Aligned Hint Synthesis and Backward Hint Annealing

Pei-Xi Xie, Che-Yu Lin, Cheng-Lin Yang

PDF

TL;DR

This paper introduces a method to improve math reinforcement learning with verifiable rewards by aligning hints with student responses and gradually reducing hint exposure, leading to better performance on challenging questions.

Contribution

The paper proposes Distribution-Aligned Hint Synthesis and Backward Hint Annealing to address distribution mismatch and hint exposure issues in math RLVR training.

Findings

01

Improves pass@1 and pass@2048 on AIME benchmarks with Qwen3-1.7B-Base.

02

Enhances large-k performance with Llama-3.2-1B-Instruct.

03

Hint scaffolding restores learnable updates early and is gradually removed.

Abstract

Reinforcement learning with verifiable rewards (RLVR) can improve low- $k$ reasoning accuracy while narrowing solution coverage on challenging math questions, and pass@1 gains do not necessarily translate into better large- $k$ performance. Existing hint-based approaches can make challenging questions trainable, but they leave two issues underexplored: teacher-student distribution mismatch and the need to reduce hint exposure to match no-hint evaluation. We address these issues through two components. Distribution-Aligned Hint Synthesis (DAHS) constructs verified teacher hints conditioned on student-style responses. Backward Hint Annealing (BHA) anneals hint exposure across difficulty buckets and uses per-question hint dropout to preserve no-hint updates throughout RL training. We evaluate the method in math RLVR under the DAPO training framework across AIME24, AIME25, and AIME26 using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.