Feature-Based Q-Learning for Two-Player Stochastic Games
Zeyu Jia, Lin F. Yang, Mengdi Wang

TL;DR
This paper introduces a feature-based Q-learning algorithm for two-player stochastic games that efficiently approximates Nash equilibria with sample complexity independent of the game's original dimensions.
Contribution
It proposes a novel two-player Q-learning method leveraging feature embeddings and develops an accelerated version with proven sample efficiency guarantees.
Findings
The basic algorithm finds an $oldsymbol{ ext{ extit{ extepsilon}}}$-optimal strategy with samples linear in features.
The accelerated algorithm achieves $oldsymbol{ ext{ extit{ extepsilon}}}$-optimality with $ ilde{oldsymbol{ ext{O}}}(K/( ext{ extit{ extepsilon}}^{2}(1-oldsymbol{ ext{ extit{ extgamma}}})^{4}))$ samples.
Sample, time, and space complexities are independent of the original game dimensions.
Abstract
Consider a two-player zero-sum stochastic game where the transition function can be embedded in a given feature space. We propose a two-player Q-learning algorithm for approximating the Nash equilibrium strategy via sampling. The algorithm is shown to find an -optimal strategy using sample size linear to the number of features. To further improve its sample efficiency, we develop an accelerated algorithm by adopting techniques such as variance reduction, monotonicity preservation and two-sided strategy approximation. We prove that the algorithm is guaranteed to find an -optimal strategy using no more than samples with high probability, where is the number of features and is a discount factor. The sample, time and space complexities of the algorithm are independent of original dimensions of the game.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control
MethodsQ-Learning
