ROTATE: Regret-driven Open-ended Training for Ad Hoc Teamwork
Caroline Wang, Arrasy Rahman, Jiaxun Cui, Yoonchang Sung, Peter Stone

TL;DR
ROTATE introduces a unified, regret-driven open-ended training framework for ad hoc teamwork, enabling agents to better generalize to unseen partners by iteratively improving and challenging the agent through adversarial teammate generation.
Contribution
The paper proposes a novel open-ended learning approach with ROTATE, combining teammate generation and agent training to enhance generalization in ad hoc teamwork scenarios.
Findings
ROTATE outperforms baselines in diverse environments.
It achieves superior generalization to unseen teammates.
The framework establishes a new standard for robust teamwork.
Abstract
Learning to collaborate with previously unseen partners is a fundamental generalization challenge in multi-agent learning, known as Ad Hoc Teamwork (AHT). Existing AHT approaches often adopt a two-stage pipeline, where first, a fixed population of teammates is generated with the idea that they should be representative of the teammates that will be seen at deployment time, and second, an AHT agent is trained to collaborate well with agents in the population. To date, the research community has focused on designing separate algorithms for each stage. This separation has led to algorithms that generate teammates with limited coverage of possible behaviors, and that ignore whether the generated teammates are easy to learn from for the AHT agent. Furthermore, algorithms for training AHT agents typically treat the set of training teammates as static, thus attempting to generalize to…
Peer Reviews
Decision·Submitted to ICLR 2026
+ Interesting notion of regret + Good comparison with related work.
- The authors could have done a better job of identifying which component is more important: the notion of regret, the way the teammates are generated, etc. - Unclear novelty. - Lack of clarity and theoretical discussion.
- Comprehensive treatment of ZSC/ad hoc teamwork: the paper unifies teammate generation and ego learning via a cooperative-regret min–max objective $ \min_{\pi^{ego}}\max_{\pi^{-i}}\mathbb{E}[\mathrm{CR}]$, making assumptions and evaluation protocol explicit.
- Clarity and exposition: the paper is difficult to follow; the core algorithmic loop (who updates when, how SP/XP/SXP are sampled/weighted, and how the BR is trained/used) is buried under notation, so the end-to-end procedure remains unclear even after multiple readings. - Mischaracterization of the gap: (a) the claim that most ZSC/AHT methods are two-stage is outdated—recent open-ended or end-to-end approaches already move beyond fixed teammate sets (e.g., COLE [1], E3T [2], TrajeDi [3]); (b)
The paper foregrounds the self-sabotage failure mode in open-ended partner generation, clearly articulating why partners that deliberately depress cross-play (XP) can inflate training signals yet harm zero-shot coordination; this diagnosis sharpens evaluation design (e.g., beyond average XP) and motivates principled mitigation objectives.
- Eq. 10 employs a fixed 0.5/0.5 weighting with no analytical justification, and the experiments do not analyze this hyperparameter. - The method section has poor readability, with unclear logic and difficult-to-follow exposition. - Missing ZSC-side baselines, especially some open-ended methods like COLE [1] and E3T [2].
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Advanced Bandit Algorithms Research
MethodsADaptive gradient method with the OPTimal convergence rate · High-Order Consensuses · Sparse Evolutionary Training
