Accelerating RL for LLM Reasoning with Optimal Advantage Regression
Kiant\'e Brantley, Mingyu Chen, Zhaolin Gao, Jason D. Lee, Wen Sun, Wenhao Zhan, Xuezhou Zhang

TL;DR
This paper introduces $A$*-PO, a novel RL framework that accelerates training of large language models for reasoning by directly approximating the optimal advantage function, reducing computational costs and memory usage.
Contribution
The paper proposes $A$*-PO, a two-stage policy optimization method that estimates the optimal value offline and performs efficient on-policy updates, improving training efficiency for LLM reasoning tasks.
Findings
Achieves competitive performance on mathematical reasoning benchmarks.
Reduces training time by up to 2 times.
Lowers peak memory usage by over 30%.
Abstract
Reinforcement learning (RL) has emerged as a powerful tool for fine-tuning large language models (LLMs) to improve complex reasoning abilities. However, state-of-the-art policy optimization methods often suffer from high computational overhead and memory consumption, primarily due to the need for multiple generations per prompt and the reliance on critic networks or advantage estimates of the current policy. In this paper, we propose *-PO, a novel two-stage policy optimization framework that directly approximates the optimal advantage function and enables efficient training of LLMs for reasoning tasks. In the first stage, we leverage offline sampling from a reference policy to estimate the optimal value function *, eliminating the need for costly online value estimation. In the second stage, we perform on-policy updates using a simple least-squares regression loss with only a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFuzzy Logic and Control Systems · Natural Language Processing Techniques
