A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation
Xiaocan Li, Shiliang Wu, Zheng Shen

TL;DR
A-3PO introduces a staleness-aware approximation for proximal policy in asynchronous RL training, significantly reducing computational overhead and accelerating large language model training while maintaining performance.
Contribution
The paper proposes A-3PO, a novel approximation method that eliminates the need for extra forward passes in proximal policy, speeding up training of large language models in asynchronous RL.
Findings
Achieves 1.8x training speedup
Maintains comparable performance to standard methods
Reduces computational overhead in asynchronous RL training
Abstract
Decoupled PPO has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss used in decoupled PPO improves coupled-loss style of algorithms' (e.g., standard PPO, GRPO) learning stability by introducing a proximal policy to decouple the off-policy correction (importance weight) from the policy update constraint (trust region). However, the proximal policy requires an extra forward pass through the model at each training step, creating a computational overhead for large language models training. We observe that since the proximal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple interpolation without explicit computation. We call this approach A-3PO (APproximated Proximal Policy Optimization). A-3PO eliminates this overhead,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Domain Adaptation and Few-Shot Learning
