WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning

Gagan Mundada; Zihan Huang; Rohan Surana; Sheldon Yu; Jennifer Yuntong Zhang; Xintong Li; Tong Yu; Lina Yao; Jingbo Shang; Julian McAuley; Junda Wu

arXiv:2602.17025·cs.LG·February 20, 2026

WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning

Gagan Mundada, Zihan Huang, Rohan Surana, Sheldon Yu, Jennifer Yuntong Zhang, Xintong Li, Tong Yu, Lina Yao, Jingbo Shang, Julian McAuley, Junda Wu

PDF

Open Access

TL;DR

WS-GRPO introduces a weakly supervised approach that enhances reasoning efficiency in language models by using outcome-based guidance to determine when to continue or stop reasoning, reducing unnecessary deliberation.

Contribution

It proposes a novel weakly supervised training method that improves rollout efficiency in reasoning models by leveraging outcome-only correctness signals for partial trajectory guidance.

Findings

01

WS-GRPO significantly reduces rollout length in reasoning tasks.

02

It maintains competitive accuracy compared to baseline methods.

03

Theoretical analysis supports the effectiveness of outcome-based guidance.

Abstract

Group Relative Policy Optimization (GRPO) is effective for training language models on complex reasoning. However, since the objective is defined relative to a group of sampled trajectories, extended deliberation can create more chances to realize relative gains, leading to inefficient reasoning and overthinking, and complicating the trade-off between correctness and rollout efficiency. Controlling this behavior is difficult in practice, considering (i) Length penalties are hard to calibrate because longer rollouts may reflect harder problems that require longer reasoning, penalizing tokens risks truncating useful reasoning along with redundant continuation; and (ii) supervision that directly indicates when to continue or stop is typically unavailable beyond final answer correctness. We propose Weakly Supervised GRPO (WS-GRPO), which improves rollout efficiency by converting terminal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics