Self-Hinting Language Models Enhance Reinforcement Learning
Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian

TL;DR
This paper introduces SAGE, a reinforcement learning framework that uses privileged hints during training to improve language model alignment under sparse rewards, leading to better performance without hints at test time.
Contribution
SAGE is a novel on-policy RL method that injects privileged hints during training to prevent advantage collapse and enhance diversity, outperforming previous methods like GRPO.
Findings
SAGE outperforms GRPO on 6 benchmarks across 3 LLMs.
Diverse self-hints serve as an adaptive curriculum for better learning.
SAGE achieves average improvements of +2.0, +1.2, and +1.3 points on different models.
Abstract
Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt , the model samples a compact hint (e.g., a plan or decomposition) and then generates a solution conditioned on . Crucially, the task reward is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗baohao/SAGE_Llama-3.2-3B-Instructmodel· 168 dl168 dl
- 🤗baohao/SAGE_Qwen2.5-7B-Instructmodel· 281 dl281 dl
- 🤗baohao/SAGE_Qwen3-4B-Instruct-2507model
- 🤗baohao/SAGE-light_Llama-3.2-3B-Instructmodel· 2 dl2 dl
- 🤗baohao/SAGE-light_Qwen2.5-7B-Instructmodel· 528 dl· ♡ 2528 dl♡ 2
- 🤗baohao/SAGE-light_Qwen3-4B-Instruct-2507model· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics
