SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

Yiqi Zhang; Huiqiang Jiang; Xufang Luo; Zhihe Yang; Chengruidong Zhang; Yifei Shen; Dongsheng Li; Yuqing Yang; Lili Qiu; Yang You

arXiv:2603.23414·cs.LG·March 25, 2026

SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

Yiqi Zhang, Huiqiang Jiang, Xufang Luo, Zhihe Yang, Chengruidong Zhang, Yifei Shen, Dongsheng Li, Yuqing Yang, Lili Qiu, Yang You

PDF

Open Access 3 Reviews

TL;DR

SortedRL introduces an online length-aware scheduling method that significantly improves RL training efficiency for large language models by reordering samples based on output length, reducing training time and enhancing performance.

Contribution

The paper presents SortedRL, a novel scheduling strategy that optimizes rollout efficiency and training stability for RL in LLMs through length-aware sample reordering and off-policy control mechanisms.

Findings

01

Reduces RL training bubble ratios by over 50%.

02

Achieves 3.9% to 18.4% performance improvements over baseline.

03

Effective on diverse tasks including logical puzzles and math challenges.

Abstract

Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 0Confidence 3

Strengths

- Significance: the paper addresses an important bottleneck in scaling RL training for reasoning models. - Methodology: the proposed method, on a high level, is reasonable and easy-to-understand.

Weaknesses

- Clarity: The method is described almost entirely through high-level qualitative descriptions. There are no formal algorithm blocks or pseudocode to define critical components like the "oversubscription strategy," "early termination" logic, or exactly how the "length-aware controller" manages the queue. This makes the mechanism ambiguous. Please provide some formal algorithm blocks in the paper to help people understand in details. Example: [[Phuong and Hutter, 2024](https://arxiv.org/pdf/2207.

Reviewer 02Rating 6Confidence 3

Strengths

- SortedRL is a novel framework designed to alleviate the significant rollout bottleneck in RL and address the instability introduced by off-policy updates that come with large rollout batches. - This system of sorting rollouts by output lengths for updates is intuitive and improves both hardware efficiency (lower bubble ratio) and sample efficiency (improved performance at earlier steps) through a higher-degree of on-policyness. - The paper includes significant quantitative results to show

Weaknesses

- A significant implicit assumption in the paper is that longer rollouts == harder prompts which is what enables the micro-curriculum. This largely holds true for math and reasoning tasks where longer rollouts mean longer, richer reasoning chains. However for other tasks like summarization, general instruction following, safety alignment etc. this is not necessarily true. The effectiveness of SortedRL on such tasks is unclear. - The paper would benefit from a deeper analysis into why SortedRL

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper is well motivated as the bubble issue is a well-known issue in RLVR training.

Weaknesses

1. Description of the central components of the proposed strategy is unclear, especially in Sec 3.1. 2. There is a lack of direct comparison with existing speedup techniques in RLVR training, including one-step-off RL training used in DeepCoder project [1] and fully asynchronous RL training in AReaL [2]. It is expected to have a comparison of the speedup with existing asynchronous training approaches. 3. The proposed strategy does not seem robust to the design choices, as evidenced by the colla

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Materials Science