GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning
Han Zhang, Ruibin Zheng, Zexuan Yi, Zhuo Zhang, Hanyang Peng, Hui Wang, Zike Yuan, Cai Ke, Shiwei Chen, Jiacheng Yang, Yangning Li, Xiang Li, Jiangyue Yan, Yaoqi Liu, Liwen Jing, Jiayin Qi, Ruifeng Xu, Binxing Fang, Yue Yu

TL;DR
This paper introduces GEPO, a novel asynchronous reinforcement learning algorithm designed for stable, decentralized training across geographically distributed nodes with heterogeneous resources, effectively reducing variance and improving stability.
Contribution
The paper proposes GEPO, a new group expectation policy optimization method that decouples parameter learning from rollout sampling, enabling robust decentralized RL training with theoretical variance reduction guarantees.
Findings
GEPO maintains high performance with only 3% drop under 1800s latency.
GEPO reduces the best-to-last gap by 85% compared to GSPO.
GEPO achieves the highest scores in decentralized, resource-heterogeneous environments.
Abstract
As single-center computing approaches power constraints, decentralized training becomes essential. However, traditional Reinforcement Learning (RL) methods, crucial for enhancing large model post-training, cannot adapt to decentralized distributed training due to the tight coupling between parameter learning and rollout sampling. For this, we propose HeteroRL, a heterogeneous RL architecture that decouples these processes, enabling stable training across geographically distributed nodes connected via the Internet. The core component is Group Expectation Policy Optimization (GEPO), an asynchronous RL algorithm robust to latency caused by network delays or heterogeneity in computational resources. Our study reveals that high latency significantly increases KL divergence, leading to higher variance of importance weights and training instability. GEPO mitigates this issue by using group…
Peer Reviews
Decision·ICLR 2026 Poster
**Clear systems problem + algorithmic handle**. The paper crisply diagnoses policy staleness in decentralized RL and ties it to KL growth → variance blow-ups; GEPO’s denominator uses a within-group expectation to damp variance in precisely that regime. **Compelling empirical stability**. Under Hetero RL with max delay 64, GEPO improves best accuracy vs. GRPO/GSPO and, crucially, reduces best-to-last degradation by ~85% vs. GSPO (Δ=1.8 vs. 12.0). Curves show lower IW variance and smoother gradie
**Bias–variance trade-off left under-quantified in RL objective**. GEPO’s estimator is acknowledged as biased; while lower variance can help optimization, the paper does not quantify end-to-end bias in policy gradients or returns beyond variance plots. A small-bias claim would benefit from controlled ablations where true on-policy gradients are approximated (short-horizon toy MDPs) to measure bias vs. sample efficiency. (GEIW is described as biased but stable.) **External validity beyond math-r
1. The visualizations illustrate the main claims of the paper. 2. The paper targeted on an important bottleneck problem in large-scale distributed reinforcement learning.
1. The paper lacks experimental comparisons with other asynchronous policy optimization methods [1, 2, 3, 4]. 2. Equation (1) appears very similar to PPO, except that the clipping function is removed. 3. In line 139, the variable $G$ is undefined; in Equation (1), it is unclear how $A(x)$ is computed, and in line 146, the input of $p$ is not specified. 4. While Theorem 1 demonstrates a reduction in the variance of the importance sampling coefficient, this result does not guarantee a correspondin
1. The idea of reducing the variance of importance weights through group expectation weighting is conceptually elegant and practically powerful. It addresses the instability issue caused by large KL divergence in asynchronous or heterogeneous RL settings. 1. The motivation, instability and variance explosion in asynchronous or heterogeneous RL due to policy staleness, is well validated by experimental results. The results across both online and heterogeneous RL settings consistently demonstrate
1. The experimental validation appears limited in scope. GEPO is only evaluated on mathematical reasoning datasets (MATH, AIME, AMC) and with relatively small models (up to 8B parameters). 1. There seems an inconsistency between the formulation and the theoretical analysis. In Section 3.1, the paper explicitly states that "the vector $(q(y_1|x), …, q(y_G|x))$ does not constitute a valid probability distribution" since top-K/top-P sampling leads to $\sum_i q(y_i|x) \gg 1$. However, in Theorem 1 a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
