Self-Play Q-learners Can Provably Collude in the Iterated Prisoner's Dilemma
Quentin Bertrand, Juan Duque, Emilio Calvano, Gauthier Gidel

TL;DR
This paper provides theoretical evidence that self-play Q-learning agents in the iterated prisoner's dilemma can reliably learn to cooperate, which explains observed collusive behaviors in social dilemmas.
Contribution
It offers the first theoretical analysis showing conditions under which self-play Q-learners learn to cooperate rather than defect in social dilemmas.
Findings
Self-play Q-learners learn the cooperative Pavlov policy under broad conditions.
Theoretical results are validated through experiments with deep learning algorithms.
Agents tend to converge to cooperation rather than defection in the iterated prisoner's dilemma.
Abstract
A growing body of computational studies shows that simple machine learning agents converge to cooperative behaviors in social dilemmas, such as collusive price-setting in oligopoly markets, raising questions about what drives this outcome. In this work, we provide theoretical foundations for this phenomenon in the context of self-play multi-agent Q-learners in the iterated prisoner's dilemma. We characterize broad conditions under which such agents provably learn the cooperative Pavlov (win-stay, lose-shift) policy rather than the Pareto-dominated "always defect" policy. We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEconomic theories and models · Complex Systems and Time Series Analysis · Auction Theory and Applications
