GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning

Han Zhang; Ruibin Zheng; Zexuan Yi; Zhuo Zhang; Hanyang Peng; Hui Wang; Zike Yuan; Cai Ke; Shiwei Chen; Jiacheng Yang; Yangning Li; Xiang Li; Jiangyue Yan; Yaoqi Liu; Liwen Jing; Jiayin Qi; Ruifeng Xu; Binxing Fang; Yue Yu

arXiv:2508.17850·cs.LG·January 30, 2026

GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning

Han Zhang, Ruibin Zheng, Zexuan Yi, Zhuo Zhang, Hanyang Peng, Hui Wang, Zike Yuan, Cai Ke, Shiwei Chen, Jiacheng Yang, Yangning Li, Xiang Li, Jiangyue Yan, Yaoqi Liu, Liwen Jing, Jiayin Qi, Ruifeng Xu, Binxing Fang, Yue Yu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces GEPO, a novel asynchronous reinforcement learning algorithm designed for stable, decentralized training across geographically distributed nodes with heterogeneous resources, effectively reducing variance and improving stability.

Contribution

The paper proposes GEPO, a new group expectation policy optimization method that decouples parameter learning from rollout sampling, enabling robust decentralized RL training with theoretical variance reduction guarantees.

Findings

01

GEPO maintains high performance with only 3% drop under 1800s latency.

02

GEPO reduces the best-to-last gap by 85% compared to GSPO.

03

GEPO achieves the highest scores in decentralized, resource-heterogeneous environments.

Abstract

As single-center computing approaches power constraints, decentralized training becomes essential. However, traditional Reinforcement Learning (RL) methods, crucial for enhancing large model post-training, cannot adapt to decentralized distributed training due to the tight coupling between parameter learning and rollout sampling. For this, we propose HeteroRL, a heterogeneous RL architecture that decouples these processes, enabling stable training across geographically distributed nodes connected via the Internet. The core component is Group Expectation Policy Optimization (GEPO), an asynchronous RL algorithm robust to latency caused by network delays or heterogeneity in computational resources. Our study reveals that high latency significantly increases KL divergence, leading to higher variance of importance weights and training instability. GEPO mitigates this issue by using group…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

**Clear systems problem + algorithmic handle**. The paper crisply diagnoses policy staleness in decentralized RL and ties it to KL growth → variance blow-ups; GEPO’s denominator uses a within-group expectation to damp variance in precisely that regime. **Compelling empirical stability**. Under Hetero RL with max delay 64, GEPO improves best accuracy vs. GRPO/GSPO and, crucially, reduces best-to-last degradation by ~85% vs. GSPO (Δ=1.8 vs. 12.0). Curves show lower IW variance and smoother gradie

Weaknesses

**Bias–variance trade-off left under-quantified in RL objective**. GEPO’s estimator is acknowledged as biased; while lower variance can help optimization, the paper does not quantify end-to-end bias in policy gradients or returns beyond variance plots. A small-bias claim would benefit from controlled ablations where true on-policy gradients are approximated (short-horizon toy MDPs) to measure bias vs. sample efficiency. (GEIW is described as biased but stable.) **External validity beyond math-r

Reviewer 02Rating 4Confidence 5

Strengths

1. The visualizations illustrate the main claims of the paper. 2. The paper targeted on an important bottleneck problem in large-scale distributed reinforcement learning.

Weaknesses

1. The paper lacks experimental comparisons with other asynchronous policy optimization methods [1, 2, 3, 4]. 2. Equation (1) appears very similar to PPO, except that the clipping function is removed. 3. In line 139, the variable $G$ is undefined; in Equation (1), it is unclear how $A(x)$ is computed, and in line 146, the input of $p$ is not specified. 4. While Theorem 1 demonstrates a reduction in the variance of the importance sampling coefficient, this result does not guarantee a correspondin

Reviewer 03Rating 6Confidence 2

Strengths

1. The idea of reducing the variance of importance weights through group expectation weighting is conceptually elegant and practically powerful. It addresses the instability issue caused by large KL divergence in asynchronous or heterogeneous RL settings. 1. The motivation, instability and variance explosion in asynchronous or heterogeneous RL due to policy staleness, is well validated by experimental results. The results across both online and heterogeneous RL settings consistently demonstrate

Weaknesses

1. The experimental validation appears limited in scope. GEPO is only evaluated on mathematical reasoning datasets (MATH, AIME, AMC) and with relatively small models (up to 8B parameters). 1. There seems an inconsistency between the formulation and the theoretical analysis. In Section 3.1, the paper explicitly states that "the vector $(q(y_1|x), …, q(y_G|x))$ does not constitute a valid probability distribution" since top-K/top-P sampling leads to $\sum_i q(y_i|x) \gg 1$. However, in Theorem 1 a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics