GAPO: Robust Advantage Estimation for Real-World Code LLMs

Jianqing Zhang; Zhezheng Hao; Wei Xia; Hande Dong; Hong Wang; Chenxing Wei; Yuyan Zhou; Yubin Qi; Qiang Lin; Jian Cao

arXiv:2510.21830·cs.LG·January 9, 2026

GAPO: Robust Advantage Estimation for Real-World Code LLMs

Jianqing Zhang, Zhezheng Hao, Wei Xia, Hande Dong, Hong Wang, Chenxing Wei, Yuyan Zhou, Yubin Qi, Qiang Lin, Jian Cao

PDF

Open Access

TL;DR

GAPO introduces an adaptive advantage estimation method for code LLMs that improves robustness against noisy reward signals in real-world scenarios, leading to better performance and efficiency.

Contribution

The paper proposes GAPO, a novel adaptive advantage estimation technique that enhances robustness and efficiency in RL fine-tuning of code LLMs under noisy reward conditions.

Findings

01

Up to 4.35 in-domain exact-match improvement

02

Up to 5.30 out-of-domain exact-match improvement

03

Lower clipping ratios and higher GPU throughput

Abstract

Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods, such as GRPO, are popular due to their critic-free and normalized advantage estimation. However, in real-world code-editing scenarios, reward distributions are often skewed with unpredictable noise, leading to distorted advantage computation and increased rollout outliers. To address this issue, we propose Group Adaptive Policy Optimization (GAPO), which adaptively finds an interval with the highest SNR (Signal to Noise Ratio) per prompt and uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation to reduce noise further. This adaptive Q robustly handles rollout noise while remaining plug-and-play and efficient. We evaluate GAPO on nine instruction-tuned LLMs (3B-14B) using a collected large dataset of 51,844…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Natural Language Processing Techniques