Self-Hinting Language Models Enhance Reinforcement Learning

Baohao Liao; Hanze Dong; Xinxing Xu; Christof Monz; Jiang Bian

arXiv:2602.03143·cs.LG·February 4, 2026

Self-Hinting Language Models Enhance Reinforcement Learning

Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian

PDF

Open Access 6 Models 2 Datasets

TL;DR

This paper introduces SAGE, a reinforcement learning framework that uses privileged hints during training to improve language model alignment under sparse rewards, leading to better performance without hints at test time.

Contribution

SAGE is a novel on-policy RL method that injects privileged hints during training to prevent advantage collapse and enhance diversity, outperforming previous methods like GRPO.

Findings

01

SAGE outperforms GRPO on 6 benchmarks across 3 LLMs.

02

Diverse self-hints serve as an adaptive curriculum for better learning.

03

SAGE achieves average improvements of +2.0, +1.2, and +1.3 points on different models.

Abstract

Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt $x$ , the model samples a compact hint $h$ (e.g., a plan or decomposition) and then generates a solution $τ$ conditioned on $(x, h)$ . Crucially, the task reward $R (x, τ)$ is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics