ClawArena: Benchmarking AI Agents in Evolving Information Environments

Haonian Ji; Kaiwen Xiong; Siwei Han; Peng Xia; Shi Qiu; Yiyang Zhou; Jiaqi Liu; Jinlong Li; Bingzhou Li; Zeyu Zheng; Cihang Xie; Huaxiu Yao

arXiv:2604.04202·cs.LG·May 19, 2026

ClawArena: Benchmarking AI Agents in Evolving Information Environments

Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao

PDF

1 Repo 1 Datasets

TL;DR

ClawArena is a comprehensive benchmark designed to evaluate AI agents' ability to maintain accurate beliefs in complex, evolving information environments with conflicting and partial data sources.

Contribution

It introduces a novel benchmark with multi-source conflict reasoning, dynamic belief revision, and implicit personalization challenges, along with diverse scenarios and evaluation formats.

Findings

01

Model capability explains 29-point score variation.

02

Framework design influences up to 24-point score variation.

03

Belief revision difficulty depends on update strategy, not volume.

Abstract

AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aiming-lab/ClawArena
github

Datasets

Haonian/ClawArena
dataset· 1.5k dl
1.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.