TL;DR
SARL is a novel label-free reinforcement learning framework that improves reasoning models by rewarding the structure of reasoning paths, leading to better performance on math and open-ended tasks.
Contribution
It introduces SARL, which emphasizes reasoning topology over outcomes, outperforming prior label-free methods and even some supervised approaches.
Findings
SARL outperforms prior label-free RL baselines on math tasks.
SARL exceeds supervised RL methods with ground truth supervision.
SARL achieves significant improvements on open-ended reasoning tasks.
Abstract
Reinforcement learning is critical to improving large reasoning models, but its success relies heavily on verifiable rewards (RLVR), making it hard to use in open-ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimizing solely toward the final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning), and we extend traditional RLVR to open-ended settings. We introduce Structure-Aware Reinforcement Learning (SARL), a label-free framework that constructs per-response reasoning maps from intermediate thinking steps and rewards their reasoning topology. SARL shifts supervision from destination to path, encouraging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
