Interpolating Between Softmax Policy Gradient and Neural Replicator   Dynamics with Capped Implicit Exploration

Dustin Morrill; Esra'a Saleh; Michael Bowling; Amy Greenwald

arXiv:2206.02036·cs.LG·June 7, 2022

Interpolating Between Softmax Policy Gradient and Neural Replicator Dynamics with Capped Implicit Exploration

Dustin Morrill, Esra'a Saleh, Michael Bowling, Amy Greenwald

PDF

Open Access

TL;DR

This paper introduces NeuRD-CIX, a new algorithm that interpolates between softmax policy gradient and neural replicator dynamics using capped implicit exploration, improving stability and performance in sequential decision-making.

Contribution

The paper proposes CIX estimates to interpolate between NeuRD and SPG, providing high-probability regret bounds and a bias-variance tradeoff analysis for better decision-making algorithms.

Findings

01

NeuRD-CIX outperforms NeuRD in non-stationary environments.

02

CIX estimates enable high-probability regret bounds.

03

NeuRD-CIX balances bias and variance effectively.

Abstract

Neural replicator dynamics (NeuRD) is an alternative to the foundational softmax policy gradient (SPG) algorithm motivated by online learning and evolutionary game theory. The NeuRD expected update is designed to be nearly identical to that of SPG, however, we show that the Monte Carlo updates differ in a substantial way: the importance correction accounting for a sampled action is nullified in the SPG update, but not in the NeuRD update. Naturally, this causes the NeuRD update to have higher variance than its SPG counterpart. Building on implicit exploration algorithms in the adversarial bandit setting, we introduce capped implicit exploration (CIX) estimates that allow us to construct NeuRD-CIX, which interpolates between this aspect of NeuRD and SPG. We show how CIX estimates can be used in a black-box reduction to construct bandit algorithms with regret bounds that hold with high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques

MethodsSoftmax