GDEPO: Group Dual-dynamic and Equal-right Advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning

Zhengqing Yan; Xinyang Liu; Yi Zhang; Fan Guo; ChengXun Jia; Junchen Wan; Yao Liu; Qi Liu; Jihao Huang; Kang Song

arXiv:2601.06795·cs.AI·January 23, 2026

GDEPO: Group Dual-dynamic and Equal-right Advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning

Zhengqing Yan, Xinyang Liu, Yi Zhang, Fan Guo, ChengXun Jia, Junchen Wan, Yao Liu, Qi Liu, Jihao Huang, Kang Song

PDF

Open Access

TL;DR

GDEPO introduces a novel reinforcement learning approach for Automated Theorem Proving, improving data efficiency and training stability by dynamically resampling, decoupling advantage signs, and applying extra gradient steps.

Contribution

It proposes GDEPO, a new policy optimization method with dynamic sampling, advantage decoupling, and iterative updates, tailored for sample-constrained ATP tasks.

Findings

01

GDEPO outperforms existing methods on multiple ATP datasets.

02

Dynamic resampling improves proof success rates.

03

Advantage decoupling stabilizes policy updates.

Abstract

Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI), requiring the construction of machine-verifiable proofs in formal languages such as Lean to evaluate AI reasoning capabilities. Reinforcement learning (RL), particularly the high-performance Group Relative Policy Optimization (GRPO) algorithm, has emerged as a mainstream approach for this task. However, in ATP scenarios, GRPO faces two critical issues: when composite rewards are used, its relative advantage estimation may conflict with the binary feedback from the formal verifier; meanwhile, its static sampling strategy may discard entire batches of data if no valid proof is found, resulting in zero contribution to model updates and significant data waste. To address these limitations, we propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO), a method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Reinforcement Learning in Robotics · Adversarial Robustness in Machine Learning