GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning
Zhiyuan Fan, Gabriele Farina

TL;DR
This paper introduces VRPO, a variance-reduction technique for self-play reinforcement learning in imperfect-information games, improving stability and performance over traditional methods like PPO.
Contribution
The paper proposes Q-boosting and VRPO, novel algorithms that reduce variance in advantage estimation, enhancing learning stability in multi-agent imperfect-information settings.
Findings
VRPO outperforms PPO in complex imperfect-information games.
Q-boosting effectively reduces advantage estimation variance.
VRPO achieves strong results in Dou Dizhu and Heads-Up No-Limit Texas Hold'em.
Abstract
Competitive multi-agent reinforcement learning in imperfect-information games requires agents to act under partial observability and against adversarial opponents, necessitating stochastic policies. While self-play reinforcement learning with Proximal Policy Optimization (PPO) has achieved strong empirical success, its standard advantage estimator, generalized advantage estimation, suffers from additional variance due to the sampling of stochastic future actions. This variance is amplified in equilibrium self-play because of the stochastic nature of the equilibrium policy and persists even when the critic is exact. We address this bottleneck by introducing -boosting, a variance-reduced advantage estimator based on a centralized action-value critic, and propose Variance-Reduced Policy Optimization (VRPO), incorporating this new estimator. The algorithm replaces sampled multi-step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
