The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

Guangrui Li; Yaochen Xie; Yi Liu; Ziwei Dong; Xingyuan Pan; Tianqi Zheng; Jason Choi; Michael J. Morais; Binit Jha; Shaunak Mishra; Bingrou Zhou; Chen Luo; Monica Xiao Cheng; Dawn Song

arXiv:2603.05910·cs.AI·May 20, 2026

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

Guangrui Li, Yaochen Xie, Yi Liu, Ziwei Dong, Xingyuan Pan, Tianqi Zheng, Jason Choi, Michael J. Morais, Binit Jha, Shaunak Mishra, Bingrou Zhou, Chen Luo, Monica Xiao Cheng, Dawn Song

PDF

TL;DR

This paper introduces ProEvolve, a graph-based framework for creating evolving environment benchmarks to evaluate the adaptability of tool-calling agents in dynamic settings.

Contribution

ProEvolve provides a programmable, graph-based method for modeling and generating evolving environments, enabling better assessment of agent robustness over time.

Findings

01

ProEvolve successfully generates dynamic environments in e-commerce and airline booking domains.

02

The framework allows explicit modeling of environment changes through graph transformations.

03

Agents' performance varies with environment evolution, highlighting the importance of adaptability.

Abstract

LLM-powered tool-calling agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks evaluate these systems under static environment interfaces, with fixed schemas and toolsets, making it difficult to assess how agents behave as environments evolves -- when capabilities are added, reorganized, or deprecated across successive environment versions. In this paper, we study structured environment evolution as a benchmark-construction problem for tool-calling agents. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment - data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel-Driven Software Engineering Techniques · Multimodal Machine Learning Applications · Artificial Intelligence in Games