Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning
Na Li, Yuchen Jiao, Hangguan Shan, Shefeng Yan

TL;DR
This paper introduces ME-Nash-QL, a memory-efficient, model-free self-play algorithm for two-player zero-sum Markov games that improves space, sample, and burn-in costs while maintaining Markov policies.
Contribution
The paper proposes ME-Nash-QL, the first algorithm with provable memory efficiency and improved complexity bounds for model-free MARL in two-player zero-sum Markov games.
Findings
Achieves space complexity $O(SABH)$ for $\e$-approximate Nash policy.
Reduces sample complexity to $ ilde{O}(H^4SAB/\e^2)$ for long horizons.
Lowers burn-in cost to $O(SAB ext{poly}(H))$ compared to previous methods.
Abstract
The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm \emph{Memory-Efficient Nash Q-Learning (ME-Nash-QL)} for two-player zero-sum Markov games, which is a specific setting of MARL. ME-Nash-QL is proven to enjoy the following merits. First, it can output an -approximate Nash policy with space complexity and sample complexity…
Peer Reviews
Decision·ICLR 2024 poster
The proposed algorithm enjoys several benign properties, as mentioned in the summary. In particular, the algorithm perform well when the horizon is very long while retaining other nice properties such as Markov output policy and low burn-in cost.
1. The proposed algorithm does not break the curse of multi-agent. Although the authors argue that there are many scenarios where horizon length is very long, I still feel that this is not general enough. I personally would still be more interested in algorithms that have $O(A+B)$ dependence in complexity. 2. The algorithmic novelty is a bit unclear to me.
+ The paper is well written and easy to follow. + The proposed algorithm outperforms existing algorithms in terms of space complexity and computational complexity.
- My main concern is the technical novelty. The reference-advantage decomposition technique has already been incorporated in two-player zero-sum Markov game by Feng el al (2023) (not cited by this work), which achieves a regret in \tilde{O}(\sqrt{H^2SABT}) and matches with the regret bound in this work. The main novelty of the algorithm design thus lies in the early-settlement design in order to reduce the burn-in cost, which is not new in the literature. Feng, S., Yin, M., Wang, Y. X., Yang, J
# Originality - The related works are covered in detail. # Quality - The theoretical proofs seem to be rigorous. # Clarity - This paper is in general well-written and easy to follow. The design idea of the algorithm is clearly explained. # Significance - The theoretical results of this work are strong. It achieves state-of-the-art space and computational complexity, nearly optimal sample complexity, and the best burn-in cost compared to previous results with the same sample complexity. - TZMG i
- Although the proposed algorithm is compared to Nash-VI (Liu et al., July 2021) and V-learning (Jin et al., 2022) in detail, the design idea of the proposed algorithm seems to share certain similarities with those from the two works. For example, they all compute a CCE policy and take the marginal policies; the choice of learning rate $\frac{H+1}{H+N}$, the form of bonus terms, and the update of lower and upper bounds for Q-functions are similar. The originality of this paper could be significa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Game Theory and Applications · Advanced Bandit Algorithms Research
