FutureSim: Replaying World Events to Evaluate Adaptive Agents

Shashwat Goel; Nikhil Chandak; Arvindh Arun; Ameya Prabhu; Steffen Staab; Moritz Hardt; Maksym Andriushchenko; Jonas Geiping

arXiv:2605.15188·cs.LG·May 15, 2026

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Shashwat Goel, Nikhil Chandak, Arvindh Arun, Ameya Prabhu, Steffen Staab, Moritz Hardt, Maksym Andriushchenko, Jonas Geiping

PDF

1 Repo 1 Datasets

TL;DR

FutureSim is a simulation framework that replays real-world events to evaluate AI agents' ability to predict and adapt to unfolding world developments over extended periods.

Contribution

The paper introduces FutureSim, a novel benchmark for assessing AI agents' long-term prediction and adaptation in realistic, dynamic environments using chronological event replay.

Findings

01

Best agent achieved 25% accuracy in predicting future events.

02

Many agents performed worse than no prediction, highlighting challenges.

03

FutureSim enables studying long-horizon adaptation, search, memory, and uncertainty reasoning.

Abstract

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openforecaster/futuresim
github

Datasets

aoiandroid/papers
dataset· 28 dl
28 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.