AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

Yang Xu; Kun Yao; Yiming Deng; Zheng Fang; Kai Ming Ting; Ming Pang

arXiv:2605.05826·cs.AI·May 8, 2026

AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

Yang Xu, Kun Yao, Yiming Deng, Zheng Fang, Kai Ming Ting, Ming Pang

PDF

TL;DR

This paper introduces AGPO, a novel reinforcement learning method that improves reasoning diversity and accuracy in large language models and search ads relevance tasks by balancing exploration and suppression of incorrect paths.

Contribution

AGPO employs a negative-dominant strategy and group advantage mechanism to maintain exploration and focus on rare correct reasoning paths, outperforming existing methods.

Findings

01

AGPO achieves state-of-the-art accuracy on five mathematical benchmarks.

02

AGPO improves pass@$k$ performance at scale.

03

In industrial search ads, AGPO enhances data annotation quality and downstream model performance.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency towards correct paths, they do not elicit fundamentally new reasoning patterns. Instead, the reasoning capability boundary of trained models often narrows compared to their base models, with base models achieving higher coverage at large sample sizes. In this work, we propose Asymmetric Group Policy Optimization (AGPO) to counteract this boundary shrinkage. AGPO adopts a negative-dominant reinforcement strategy to suppress incorrect reasoning paths, maintaining the base model's exploration capacity. For positive reinforcement, AGPO adopts a group advantage mechanism, which scales positive updates based on intra-group variance, allowing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.