AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin

Jian Xiong; Jingbo Zhou; Jingyong Ye; Qiang Huang; Dejing Dou

arXiv:2505.14264·cs.LG·April 15, 2026

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin

Jian Xiong, Jingbo Zhou, Jingyong Ye, Qiang Huang, Dejing Dou

PDF

1 Repo

TL;DR

This paper introduces AAPO, a new reinforcement learning algorithm that improves reasoning in large language models by enhancing advantage estimation, leading to better performance on mathematical reasoning tasks.

Contribution

AAPO is a novel RL method that uses margin-based advantage augmentation to address training inefficiencies in group advantage estimation for LLM reasoning.

Findings

01

AAPO outperforms existing methods on multiple mathematical reasoning benchmarks.

02

The margin-based advantage estimation improves training efficiency and model performance.

03

Code for AAPO is publicly available at the provided GitHub link.

Abstract

Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs), especially in scenarios where supervised fine-tuning (SFT) falls short due to limited chain-of-thought (CoT) data. Among RL-based post-training methods, group relative advantage estimation, as exemplified by Group Relative Policy Optimization (GRPO), has attracted considerable attention for eliminating the dependency on the value model, thereby simplifying training compared to traditional approaches like Proximal Policy Optimization (PPO). However, we observe that exsiting group relative advantage estimation method still suffers from training inefficiencies, particularly when the estimated advantage approaches zero. To address this limitation, we propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimizes the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JianxXiong/AAPO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.