Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games

Connacher Murphy

arXiv:2605.04312·cs.AI·May 7, 2026

Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games

Connacher Murphy

PDF

TL;DR

Agent Island is a multiplayer benchmark environment that evaluates language-model agents in competitive multiagent games, addressing issues of saturation and contamination in capability assessment.

Contribution

We introduce a dynamic, winner-take-all multiplayer environment with adaptive agents and Bayesian skill ranking, providing a more robust benchmark for model capabilities.

Findings

01

OpenAI GPT-5.5 outperforms peers with a skill score of 5.64.

02

The environment mitigates saturation and contamination issues in benchmarks.

03

Models show provider-based voting preferences, strongest for OpenAI.

Abstract

Static capabilities benchmarks suffer from saturation and contamination, making it difficult to track capabilities progress over time. We introduce Agent Island, a multiplayer simulation environment in which language-model agents compete in a game of interagent cooperation, conflict, and persuasion. The environment yields a dynamic benchmark designed to mitigate both saturation and contamination; new models can always outperform the current leading player in this winner-take-all game, and agents compete against other adaptive agents rather than face a fixed task set. We rank players with a Bayesian Plackett-Luce model, allowing us to quantify uncertainty in player skill. In 999 games involving 49 unique models, openai/gpt-5.5 dominates its peers with a posterior mean skill of 5.64, compared with 3.10 for the second-ranked model, openai/gpt-5.2, and 2.86 for the third-ranked model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.