Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback
Shinji Ito, Haipeng Luo, Taira Tsuchiya, Yue Wu

TL;DR
This paper demonstrates that in two-player zero-sum games with bandit feedback, players using the Tsallis-INF algorithm can achieve accelerated regret bounds and convergence, especially when a pure strategy Nash equilibrium exists.
Contribution
It provides the first analysis of regret bounds under bandit feedback for two-player zero-sum games, showing improved, instance-dependent bounds and convergence guarantees.
Findings
Regret bound of O(c_1 log T + sqrt(c_2 T)) with bandit feedback
Optimal regret bound when a pure strategy Nash equilibrium exists
Algorithm achieves last-iterate convergence and near-optimal sample complexity
Abstract
No-regret self-play learning dynamics have become one of the premier ways to solve large-scale games in practice. Accelerating their convergence via improving the regret of the players over the naive bound after rounds has been extensively studied in recent years, but almost all studies assume access to exact gradient feedback. We address the question of whether acceleration is possible under bandit feedback only and provide an affirmative answer for two-player zero-sum normal-form games. Specifically, we show that if both players apply the Tsallis-INF algorithm of Zimmert and Seldin (2018, arXiv:1807.07623), then their regret is at most , where and are game-dependent constants that characterize the difficulty of learning -- resembles the complexity of learning a stochastic multi-armed bandit instance and depends…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Auction Theory and Applications · Data Stream Mining Techniques
