ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

Wanjia Zhao; Ludwig Schmidt; James Zou; Vidhisha Balachandran; Lingjiao Chen

arXiv:2603.18614·cs.AI·March 20, 2026

ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

Wanjia Zhao, Ludwig Schmidt, James Zou, Vidhisha Balachandran, Lingjiao Chen

PDF

Open Access

TL;DR

ZebraArena is a diagnostic environment designed to evaluate reasoning and tool use in large language models, minimizing confounding factors and enabling precise measurement of reasoning-action coupling.

Contribution

It introduces a controllable, knowledge-minimal benchmark for studying reasoning and external tool interaction in LLMs, with deterministic evaluation and theoretical optimal query metrics.

Findings

01

Frontier models like GPT-5 achieve 60% accuracy on hard tasks.

02

Models often use significantly more tool calls than the theoretical optimum.

03

ZebraArena reveals persistent gaps between optimal and actual tool use.

Abstract

Tool-augmented large language models (LLMs) must tightly couple multi-step reasoning with external actions, yet existing benchmarks often confound this interplay with complex environment dynamics, memorized knowledge or dataset contamination. In this paper, we introduce ZebraArena, a procedurally generated diagnostic environment for studying reasoning-action coupling in tool-augmented LLMs, with controllable difficulty and a knowledge-minimal design, which limits gains from memorization or dataset contamination. Each task in ZebraArena requires a set of critical information which is available only through targeted tool use, yielding an interpretable interface between external information acquisition and deductive reasoning. This design provides deterministic evaluation via unique solutions, and a theoretical optimal query count for measuring efficient tool use. We show that ZebraArena…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)