Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Romain Froger; Pierre Andrews; Matteo Bettini; Amar Budhiraja; Ricardo Silveira Cabral; Virginie Do; Emilien Garreau; Jean-Baptiste Gaya; Hugo Lauren\c{c}on; Maxime Lecanu; Kunal Malkan; Dheeraj Mekala; Pierre M\'enard; Gerard Moreno-Torres Bertran; Ulyana Piterbarg; Mikhail Plekhanov; Mathieu Rita; Andrey Rusakov; Vladislav Vorotilov; Mengjue Wang; Ian Yu; Amine Benhalloum; Gr\'egoire Mialon; Thomas Scialom

arXiv:2602.11964·cs.AI·February 13, 2026

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Romain Froger, Pierre Andrews, Matteo Bettini, Amar Budhiraja, Ricardo Silveira Cabral, Virginie Do, Emilien Garreau, Jean-Baptiste Gaya, Hugo Lauren\c{c}on, Maxime Lecanu, Kunal Malkan, Dheeraj Mekala, Pierre M\'enard, Gerard Moreno-Torres Bertran, Ulyana Piterbarg

PDF

Open Access

TL;DR

Gaia2 is a new benchmark for evaluating large language model agents in realistic, asynchronous, and dynamic environments, emphasizing temporal constraints, noise, ambiguity, and collaboration, with detailed action-level evaluation.

Contribution

We introduce Gaia2, a comprehensive benchmark and evaluation framework for LLM agents in complex, asynchronous environments, enabling fine-grained assessment and fostering development of practical agent systems.

Findings

01

GPT-5 achieves 42% pass@1 but struggles with time-sensitive tasks.

02

Claude-4 Sonnet balances accuracy and speed but at higher cost.

03

Kimi-K2 leads among open-source models with 21% pass@1.

Abstract

We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve ambiguity, and collaborate with other agents. Each scenario is paired with a write-action verifier, enabling fine-grained, action-level evaluation and making Gaia2 directly usable for reinforcement learning from verifiable rewards. Our evaluation of state-of-the-art proprietary and open-source models shows that no model dominates across capabilities: GPT-5 (high) reaches the strongest overall score of 42% pass@1 but fails on time-sensitive tasks, Claude-4 Sonnet trades accuracy and speed for cost, Kimi-K2 leads among open-source models with 21%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education