A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation

Xiaocan Li; Shiliang Wu; Zheng Shen

arXiv:2512.06547·cs.LG·March 9, 2026

A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation

Xiaocan Li, Shiliang Wu, Zheng Shen

PDF

Open Access

TL;DR

A-3PO introduces a staleness-aware approximation for proximal policy in asynchronous RL training, significantly reducing computational overhead and accelerating large language model training while maintaining performance.

Contribution

The paper proposes A-3PO, a novel approximation method that eliminates the need for extra forward passes in proximal policy, speeding up training of large language models in asynchronous RL.

Findings

01

Achieves 1.8x training speedup

02

Maintains comparable performance to standard methods

03

Reduces computational overhead in asynchronous RL training

Abstract

Decoupled PPO has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss used in decoupled PPO improves coupled-loss style of algorithms' (e.g., standard PPO, GRPO) learning stability by introducing a proximal policy to decouple the off-policy correction (importance weight) from the policy update constraint (trust region). However, the proximal policy requires an extra forward pass through the model at each training step, creating a computational overhead for large language models training. We observe that since the proximal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple interpolation without explicit computation. We call this approach A-3PO (APproximated Proximal Policy Optimization). A-3PO eliminates this overhead,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Topic Modeling · Domain Adaptation and Few-Shot Learning