TL;DR
This paper investigates last-iterate convergence in zero-sum games with bandit feedback, establishing fundamental limits and proposing algorithms that achieve near-optimal convergence rates without communication.
Contribution
It identifies the optimal convergence rate for uncoupled algorithms in bandit settings and introduces two algorithms that attain this rate up to logarithmic factors.
Findings
The best achievable convergence rate for last-iterate in this setting is (T^{-1/4})
Proposed algorithms match this rate up to constant and logarithmic factors
Guarantees are provided without communication between players.
Abstract
We study the problem of learning in zero-sum matrix games with repeated play and bandit feedback. Specifically, we focus on developing uncoupled algorithms that guarantee, without communication between players, the convergence of the last-iterate to a Nash equilibrium. Although the non-bandit case has been studied extensively, this setting has only been explored recently, with a bound of on the exploitability gap. We show that, for uncoupled algorithms, guaranteeing convergence of the policy profiles to a Nash equilibrium is detrimental to the performance, with the best attainable rate being in contrast to the usual rate for convergence of the average iterates. We then propose two algorithms that achieve this optimal rate up to constant and logarithmic factors. The first algorithm leverages a straightforward trade-off between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
