Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments
Romain Froger, Pierre Andrews, Matteo Bettini, Amar Budhiraja, Ricardo Silveira Cabral, Virginie Do, Emilien Garreau, Jean-Baptiste Gaya, Hugo Lauren\c{c}on, Maxime Lecanu, Kunal Malkan, Dheeraj Mekala, Pierre M\'enard, Gerard Moreno-Torres Bertran, Ulyana Piterbarg

TL;DR
Gaia2 is a new benchmark for evaluating large language model agents in realistic, asynchronous, and dynamic environments, emphasizing temporal constraints, noise, ambiguity, and collaboration, with detailed action-level evaluation.
Contribution
We introduce Gaia2, a comprehensive benchmark and evaluation framework for LLM agents in complex, asynchronous environments, enabling fine-grained assessment and fostering development of practical agent systems.
Findings
GPT-5 achieves 42% pass@1 but struggles with time-sensitive tasks.
Claude-4 Sonnet balances accuracy and speed but at higher cost.
Kimi-K2 leads among open-source models with 21% pass@1.
Abstract
We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve ambiguity, and collaborate with other agents. Each scenario is paired with a write-action verifier, enabling fine-grained, action-level evaluation and making Gaia2 directly usable for reinforcement learning from verifiable rewards. Our evaluation of state-of-the-art proprietary and open-source models shows that no model dominates across capabilities: GPT-5 (high) reaches the strongest overall score of 42% pass@1 but fails on time-sensitive tasks, Claude-4 Sonnet trades accuracy and speed for cost, Kimi-K2 leads among open-source models with 21%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education
