TL;DR
AstraFlow introduces a dataflow-oriented reinforcement learning system for large language models, enabling efficient multi-policy training and elastic resource utilization with minimal system engineering effort.
Contribution
It replaces trainer-centered control with principled component abstractions, supporting complex workloads and diverse compute resources without system modifications.
Findings
Supports multi-policy training and elastic scaling effectively.
Achieves 2.7x faster training times compared to existing RL systems.
Maintains comparable or better accuracy across various workloads.
Abstract
Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
