AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD
Yang Xu, Kun Yao, Yiming Deng, Zheng Fang, Kai Ming Ting, Ming Pang

TL;DR
This paper introduces AGPO, a novel reinforcement learning method that improves reasoning diversity and accuracy in large language models and search ads relevance tasks by balancing exploration and suppression of incorrect paths.
Contribution
AGPO employs a negative-dominant strategy and group advantage mechanism to maintain exploration and focus on rare correct reasoning paths, outperforming existing methods.
Findings
AGPO achieves state-of-the-art accuracy on five mathematical benchmarks.
AGPO improves pass@$k$ performance at scale.
In industrial search ads, AGPO enhances data annotation quality and downstream model performance.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency towards correct paths, they do not elicit fundamentally new reasoning patterns. Instead, the reasoning capability boundary of trained models often narrows compared to their base models, with base models achieving higher coverage at large sample sizes. In this work, we propose Asymmetric Group Policy Optimization (AGPO) to counteract this boundary shrinkage. AGPO adopts a negative-dominant reinforcement strategy to suppress incorrect reasoning paths, maintaining the base model's exploration capacity. For positive reinforcement, AGPO adopts a group advantage mechanism, which scales positive updates based on intra-group variance, allowing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
