Train Less, Learn More: Adaptive Efficient Rollout Optimization for Group-Based Reinforcement Learning

Zhi Zhang; Zhen Han; Costas Mavromatis; Qi Zhu; Yunyi Zhang; Sheng Guan; Dingmin Wang; Xiong Zhou; Shuai Wang; Soji Adeshina; Vassilis Ioannidis; and Huzefa Rangwala

arXiv:2602.14338·cs.LG·February 17, 2026

Train Less, Learn More: Adaptive Efficient Rollout Optimization for Group-Based Reinforcement Learning

Zhi Zhang, Zhen Han, Costas Mavromatis, Qi Zhu, Yunyi Zhang, Sheng Guan, Dingmin Wang, Xiong Zhou, Shuai Wang, Soji Adeshina, Vassilis Ioannidis, and Huzefa Rangwala

PDF

Open Access

TL;DR

AERO enhances group-based reinforcement learning for large language models by adaptively pruning rollouts and maintaining Bayesian posteriors, significantly reducing compute costs while preserving or improving performance.

Contribution

AERO introduces an adaptive rollout strategy with selective rejection and Bayesian updates, improving efficiency in RL fine-tuning of large language models.

Findings

01

AERO reduces total training compute by about 48%.

02

AERO shortens wall-clock time per step by about 45%.

03

AERO matches or improves performance metrics over GRPO.

Abstract

Reinforcement learning (RL) plays a central role in large language model (LLM) post-training. Among existing approaches, Group Relative Policy Optimization (GRPO) is widely used, especially for RL with verifiable rewards (RLVR) fine-tuning. In GRPO, each query prompts the LLM to generate a group of rollouts with a fixed group size $N$ . When all rollouts in a group share the same outcome, either all correct or all incorrect, the group-normalized advantages become zero, yielding no gradient signal and wasting fine-tuning compute. We introduce Adaptive Efficient Rollout Optimization (AERO), an enhancement of GRPO. AERO uses an adaptive rollout strategy, applies selective rejection to strategically prune rollouts, and maintains a Bayesian posterior to prevent zero-advantage dead zones. Across three model configurations (Qwen2.5-Math-1.5B, Qwen2.5-7B, and Qwen2.5-7B-Instruct), AERO improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Artificial Intelligence in Healthcare and Education