Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

Na Li; Yuchen Jiao; Hangguan Shan; Shefeng Yan

arXiv:2512.00351·cs.LG·December 2, 2025

Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

Na Li, Yuchen Jiao, Hangguan Shan, Shefeng Yan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ME-Nash-QL, a memory-efficient, model-free self-play algorithm for two-player zero-sum Markov games that improves space, sample, and burn-in costs while maintaining Markov policies.

Contribution

The paper proposes ME-Nash-QL, the first algorithm with provable memory efficiency and improved complexity bounds for model-free MARL in two-player zero-sum Markov games.

Findings

01

Achieves space complexity $O(SABH)$ for $\e$-approximate Nash policy.

02

Reduces sample complexity to $ ilde{O}(H^4SAB/\e^2)$ for long horizons.

03

Lowers burn-in cost to $O(SAB ext{poly}(H))$ compared to previous methods.

Abstract

The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm \emph{Memory-Efficient Nash Q-Learning (ME-Nash-QL)} for two-player zero-sum Markov games, which is a specific setting of MARL. ME-Nash-QL is proven to enjoy the following merits. First, it can output an $ε$ -approximate Nash policy with space complexity $O (S A B H)$ and sample complexity…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The proposed algorithm enjoys several benign properties, as mentioned in the summary. In particular, the algorithm perform well when the horizon is very long while retaining other nice properties such as Markov output policy and low burn-in cost.

Weaknesses

1. The proposed algorithm does not break the curse of multi-agent. Although the authors argue that there are many scenarios where horizon length is very long, I still feel that this is not general enough. I personally would still be more interested in algorithms that have $O(A+B)$ dependence in complexity. 2. The algorithmic novelty is a bit unclear to me.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

+ The paper is well written and easy to follow. + The proposed algorithm outperforms existing algorithms in terms of space complexity and computational complexity.

Weaknesses

- My main concern is the technical novelty. The reference-advantage decomposition technique has already been incorporated in two-player zero-sum Markov game by Feng el al (2023) (not cited by this work), which achieves a regret in \tilde{O}(\sqrt{H^2SABT}) and matches with the regret bound in this work. The main novelty of the algorithm design thus lies in the early-settlement design in order to reduce the burn-in cost, which is not new in the literature. Feng, S., Yin, M., Wang, Y. X., Yang, J

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

# Originality - The related works are covered in detail. # Quality - The theoretical proofs seem to be rigorous. # Clarity - This paper is in general well-written and easy to follow. The design idea of the algorithm is clearly explained. # Significance - The theoretical results of this work are strong. It achieves state-of-the-art space and computational complexity, nearly optimal sample complexity, and the best burn-in cost compared to previous results with the same sample complexity. - TZMG i

Weaknesses

- Although the proposed algorithm is compared to Nash-VI (Liu et al., July 2021) and V-learning (Jin et al., 2022) in detail, the design idea of the proposed algorithm seems to share certain similarities with those from the two works. For example, they all compute a CCE policy and take the marginal policies; the choice of learning rate $\frac{H+1}{H+N}$, the form of bonus terms, and the update of lower and upper bounds for Q-functions are similar. The originality of this paper could be significa

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Game Theory and Applications · Advanced Bandit Algorithms Research