TL;DR
GRLO demonstrates that a small amount of RLHF in open-ended environments can significantly enhance language models' generalization, reducing training costs while maintaining competitive performance.
Contribution
This work shows that limited RLHF training in open-ended environments can transfer conversational skills to downstream tasks, reducing data and compute needs.
Findings
GRLO improves average performance from 24.1 to 63.1 with minimal prompts and compute.
It requires 46 times less data and 68 times less compute than in-domain RLVR.
A subsequent RLVR stage offers limited additional gains mainly on complex benchmarks.
Abstract
Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier-backed environments. The latter has dominated recent reasoning-oriented post-training because it delivers stronger gains and higher efficiency on domain-specific tasks (e.g., reasoning). However, although in-domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
