Interpolating Between Softmax Policy Gradient and Neural Replicator Dynamics with Capped Implicit Exploration
Dustin Morrill, Esra'a Saleh, Michael Bowling, Amy Greenwald

TL;DR
This paper introduces NeuRD-CIX, a new algorithm that interpolates between softmax policy gradient and neural replicator dynamics using capped implicit exploration, improving stability and performance in sequential decision-making.
Contribution
The paper proposes CIX estimates to interpolate between NeuRD and SPG, providing high-probability regret bounds and a bias-variance tradeoff analysis for better decision-making algorithms.
Findings
NeuRD-CIX outperforms NeuRD in non-stationary environments.
CIX estimates enable high-probability regret bounds.
NeuRD-CIX balances bias and variance effectively.
Abstract
Neural replicator dynamics (NeuRD) is an alternative to the foundational softmax policy gradient (SPG) algorithm motivated by online learning and evolutionary game theory. The NeuRD expected update is designed to be nearly identical to that of SPG, however, we show that the Monte Carlo updates differ in a substantial way: the importance correction accounting for a sampled action is nullified in the SPG update, but not in the NeuRD update. Naturally, this causes the NeuRD update to have higher variance than its SPG counterpart. Building on implicit exploration algorithms in the adversarial bandit setting, we introduce capped implicit exploration (CIX) estimates that allow us to construct NeuRD-CIX, which interpolates between this aspect of NeuRD and SPG. We show how CIX estimates can be used in a black-box reduction to construct bandit algorithms with regret bounds that hold with high…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques
MethodsSoftmax
