Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games
Connacher Murphy

TL;DR
Agent Island is a multiplayer benchmark environment that evaluates language-model agents in competitive multiagent games, addressing issues of saturation and contamination in capability assessment.
Contribution
We introduce a dynamic, winner-take-all multiplayer environment with adaptive agents and Bayesian skill ranking, providing a more robust benchmark for model capabilities.
Findings
OpenAI GPT-5.5 outperforms peers with a skill score of 5.64.
The environment mitigates saturation and contamination issues in benchmarks.
Models show provider-based voting preferences, strongest for OpenAI.
Abstract
Static capabilities benchmarks suffer from saturation and contamination, making it difficult to track capabilities progress over time. We introduce Agent Island, a multiplayer simulation environment in which language-model agents compete in a game of interagent cooperation, conflict, and persuasion. The environment yields a dynamic benchmark designed to mitigate both saturation and contamination; new models can always outperform the current leading player in this winner-take-all game, and agents compete against other adaptive agents rather than face a fixed task set. We rank players with a Bayesian Plackett-Luce model, allowing us to quantify uncertainty in player skill. In 999 games involving 49 unique models, openai/gpt-5.5 dominates its peers with a posterior mean skill of 5.64, compared with 3.10 for the second-ranked model, openai/gpt-5.2, and 2.86 for the third-ranked model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
