TL;DR
This paper introduces AAPO, a new reinforcement learning algorithm that improves reasoning in large language models by enhancing advantage estimation, leading to better performance on mathematical reasoning tasks.
Contribution
AAPO is a novel RL method that uses margin-based advantage augmentation to address training inefficiencies in group advantage estimation for LLM reasoning.
Findings
AAPO outperforms existing methods on multiple mathematical reasoning benchmarks.
The margin-based advantage estimation improves training efficiency and model performance.
Code for AAPO is publicly available at the provided GitHub link.
Abstract
Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs), especially in scenarios where supervised fine-tuning (SFT) falls short due to limited chain-of-thought (CoT) data. Among RL-based post-training methods, group relative advantage estimation, as exemplified by Group Relative Policy Optimization (GRPO), has attracted considerable attention for eliminating the dependency on the value model, thereby simplifying training compared to traditional approaches like Proximal Policy Optimization (PPO). However, we observe that exsiting group relative advantage estimation method still suffers from training inefficiencies, particularly when the estimated advantage approaches zero. To address this limitation, we propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimizes the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
