Learning to Hint for Reinforcement Learning

Yu Xia; Canwen Xu; Zhewei Yao; Julian McAuley; Yuxiong He

arXiv:2604.00698·cs.LG·April 2, 2026

Learning to Hint for Reinforcement Learning

Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, Yuxiong He

PDF

1 Repo

TL;DR

This paper introduces HiLL, a framework for reinforcement learning that adaptively generates hints conditioned on the current policy, improving learning signals and transferability.

Contribution

It proposes a joint training method for a hinter and reasoner policy, with transfer-aware hint generation to enhance RL performance.

Findings

01

HiLL outperforms GRPO and prior hint-based methods across benchmarks.

02

Adaptive hints improve learning signals compared to fixed hints.

03

Transfer-weighted rewards promote better policy transfer from hinted to no-hint scenarios.

Abstract

Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non-zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no-hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Andree-9/HiLL
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.