Local and adaptive mirror descents in extensive-form games
C\^ome Fiegel, Pierre M\'enard, Tadashi Kozuno, R\'emi Munos, Vianney, Perchet, Michal Valko

TL;DR
This paper introduces an adaptive online mirror descent algorithm for learning near-optimal strategies in zero-sum imperfect information games, reducing variance and improving convergence rates through fixed sampling and local updates.
Contribution
It proposes a novel adaptive OMD approach with local updates and decreasing learning rates, achieving near-optimal convergence in extensive-form games with fixed sampling.
Findings
Guarantees a convergence rate of .5 with high probability.
Achieves near-optimal dependence on game parameters with optimal learning rates and sampling policies.
Generalizes OMD stabilization to include time-varying regularization.
Abstract
We study how to learn -optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially based on their observations over a fixed number of episodes, denoted by . Existing procedures suffer from high variance due to the use of importance sampling over sequences of actions (Steinberger et al., 2020; McAleer et al., 2022). To reduce this variance, we consider a fixed sampling approach, where players still update their policies over time, but with observations obtained through a given fixed sampling policy. Our approach is based on an adaptive Online Mirror Descent (OMD) algorithm that applies OMD locally to each information set, using individually decreasing learning rates and a regularized loss. We show that this approach guarantees a convergence rate of with…
Peer Reviews
Decision·NeurIPS 2024 poster
The paper provides a strong technical contribution in producing an adaptive trajectory-feedback-based adaptive OMD algorithm that doesn't rely on importance sampling and generalizing DS-OMD to use time-varying regularization. The paper provides empirical evidence for the convergence and variance of their approach, compared to other approaches in the literature, and also provides code for their algorithm.
The paper could delineate its contributions better. As noted by most reviewers in a previous submission, the similarity to the regret circuit decomposition of CFR is apparent (and noted by the authors). Still, the difference could be further highlighted in the contributions, especially since this was confusing for several reviewers last time. A note was made by the authors last time regarding the interpretation of their method as regularization at the global level (whereas CFR doesn't have this
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
