CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale
Jonathan Hyun, Nicholas R Waytowich, Boyuan Chen

TL;DR
CREW-Wildfire is a comprehensive benchmark designed to evaluate large-scale, multi-agent AI systems in complex wildfire response scenarios, addressing limitations of existing small-scale, low-complexity environments.
Contribution
It introduces a realistic, scalable wildfire response environment with diverse agents and tasks, enabling assessment of advanced multi-agent coordination and planning capabilities.
Findings
State-of-the-art LLM-based frameworks show significant performance gaps.
Highlights challenges in large-scale coordination and long-horizon planning.
Provides a foundation for future research in scalable multi-agent AI.
Abstract
Despite rapid progress in large language model (LLM)-based multi-agent systems, current benchmarks fall short in evaluating their scalability, robustness, and coordination capabilities in complex, dynamic, real-world tasks. Existing environments typically focus on small-scale, fully observable, or low-complexity domains, limiting their utility for developing and assessing next-generation multi-agent Agentic AI frameworks. We introduce CREW-Wildfire, an open-source benchmark designed to close this gap. Built atop the human-AI teaming CREW simulation platform, CREW-Wildfire offers procedurally generated wildfire response scenarios featuring large maps, heterogeneous agents, partial observability, stochastic dynamics, and long-horizon planning objectives. The environment supports both low-level control and high-level natural language interactions through modular Perception and Execution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Multimodal Machine Learning Applications · Topic Modeling
