Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning
Antoine Bergerault, Volkan Cevher, Negar Mehr

TL;DR
This paper investigates the limits of multi-agent imitation learning in multi-player games, demonstrating fundamental hardness results and proposing conditions under which low-exploitability policies can be learned effectively.
Contribution
It provides the first formal hardness results for offline multi-agent imitation learning and introduces strategic dominance assumptions to achieve low exploitability.
Findings
Impossibility results for low-exploitability policies in general n-player games.
A bound on Nash imitation gap under dominant strategy equilibria.
Generalization of results using best-response continuity and regularization.
Abstract
Multi-agent imitation learning (MA-IL) aims to learn optimal policies from expert demonstrations of interactions in multi-agent interactive domains. Despite existing guarantees on the performance of the resulting learned policies, characterizations of how far the learned polices are from a Nash equilibrium are missing for offline MA-IL. In this paper, we demonstrate impossibility and hardness results of learning low-exploitable policies in general -player Markov Games. We do so by providing examples where even exact measure matching fails, and demonstrating a new hardness result on characterizing the Nash gap given a fixed measure matching error. We then show how these challenges can be overcome using strategic dominance assumptions on the expert equilibrium. Specifically, for the case of dominant strategy expert equilibria, assuming Behavioral Cloning error ,…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper presents a strong theoretical contribution, providing clear impossibility and hardness results that formalize the limits of multi-agent imitation learning (MA-IL). - The proposed δ-continuity framework is a novel and interesting concept that bridges equilibrium stability and policy learning theory. - The PPAD-hardness argument is rigorous and well-motivated. - The proofs are detailed and supported by illustrative examples (though I did not check all the details). - The work c
- My major concern is that, although the paper is theoretically strong, it lacks empirical validation to demonstrate how the theoretical insights (e.g., δ-continuity effects and dominance structures) manifest in practical game settings. - The theory assumes access to known imitation errors (e.g., Equation 2), but it does not discuss how these quantities can be estimated or bounded in real applications. - The assumptions of dominant strategy equilibria (DSE) or δ-continuity, while mathematica
* Excellent exposition of the concepts, related work, and background. It is difficult to do based on the sheer quantity of concepts that the work introduces and uses, which makes it all the more impressive and appreciated. Sec. 3 is well balanced with just enough depth to grasp the problem. * The proof outline of Th. 3 is appreciated and very useful. * The paper presents its results with an extremely high level of polish.
* The finding that deriving exploitability upper bounds ultimately depnds on characterizing delta (and on the conditions that make delta consistent and tractable) is an interesting and valuable contribution. Still, the paper could better connect this result to more practical considerations. It would help to illustrate how specific techniques or settings might influence/shape delta in applied contexts. The brief mentions of “promoting exploration” and “penalizing risk aversion” (last lines of Sec
- Overall the paper is well-written, the question studied in this paper is clearly stated and most proofs are easy to follow. - The paper provides constructive examples where BC error and measure matching error fail to capture the performance of learned equilibria. - Although the paper is not mathematically heavy, I believe the result carries some significance to the game theory community.
- One issue with the definition of the tight Nash gap lower bound is that $\mathcal{M}_{\epsilon}(\pi^{E})$ is not a convex set due to the equality constraint. Given this non-convex nature, it is not surprising that computing this bound leads to some hardness results (maybe even NP-hardness). While it is understandable to impose such equality constraint, the authors could consider more natural way to capture this gap. - To obtain a meaningful result from Lemma 4, one would expect $\delta(\epsil
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Game Theory and Applications
