DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

Tianhao Hu; Xiangcheng Liu; Youshao Xiao; Yang Zheng; Xuan Huang; Jinrui Ding; Yufei Zhang; Tao Liang; Hongyu Zang; Quan Chen; Yueqing Sun; Wenjie Shi; Chao Zhang; Wei Wang; Qi Gu; Yerui Sun; Yucheng Xie; Xunliang Cai

arXiv:2604.26256·cs.LG·April 30, 2026

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

Tianhao Hu, Xiangcheng Liu, Youshao Xiao, Yang Zheng, Xuan Huang, Jinrui Ding, Yufei Zhang, Tao Liang, Hongyu Zang, Quan Chen, Yueqing Sun, Wenjie Shi, Chao Zhang, Wei Wang, Qi Gu, Yerui Sun, Yucheng Xie, Xunliang Cai

PDF

TL;DR

DORA is a scalable asynchronous RL system for language model training that significantly improves throughput and training speed while maintaining convergence, by introducing multi-version streaming rollout to address long-tailed trajectories.

Contribution

The paper presents DORA, a novel asynchronous RL training system with multi-version streaming rollout, addressing long-tailed trajectories and achieving higher efficiency without losing convergence.

Findings

01

DORA achieves 2-3x higher throughput than state-of-the-art systems.

02

In industrial settings, DORA accelerates RL training by 2-4x over synchronous methods.

03

Open-source models trained with DORA match advanced LLMs on reasoning benchmarks.

Abstract

Reinforcement learning (RL) has become a critical paradigm for LLM post-training, yet the rollout phase -- accounting for 50--80% of total step time -- is bottlenecked by skewed generation: long-tailed trajectories indispensable for model performance block the entire training pipeline. Asynchronous training offers a natural remedy by overlapping generation with training, but introduces a fundamental tension between efficiency and algorithmic correctness. We identify three constraints in asynchronous training to preserve convergence: intra-trajectory policy consistency, data integrity, and bounded staleness. Existing approaches fail to intrinsically address the long-tailed trajectory problem, which is further exacerbated by the imbalance characteristic of Mix-of-Experts models, or deviate from the standard RL training formulation, thereby hindering model convergence. Therefore, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.