ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs
Wanjia Zhao, Ludwig Schmidt, James Zou, Vidhisha Balachandran, Lingjiao Chen

TL;DR
ZebraArena is a diagnostic environment designed to evaluate reasoning and tool use in large language models, minimizing confounding factors and enabling precise measurement of reasoning-action coupling.
Contribution
It introduces a controllable, knowledge-minimal benchmark for studying reasoning and external tool interaction in LLMs, with deterministic evaluation and theoretical optimal query metrics.
Findings
Frontier models like GPT-5 achieve 60% accuracy on hard tasks.
Models often use significantly more tool calls than the theoretical optimum.
ZebraArena reveals persistent gaps between optimal and actual tool use.
Abstract
Tool-augmented large language models (LLMs) must tightly couple multi-step reasoning with external actions, yet existing benchmarks often confound this interplay with complex environment dynamics, memorized knowledge or dataset contamination. In this paper, we introduce ZebraArena, a procedurally generated diagnostic environment for studying reasoning-action coupling in tool-augmented LLMs, with controllable difficulty and a knowledge-minimal design, which limits gains from memorization or dataset contamination. Each task in ZebraArena requires a set of critical information which is available only through targeted tool use, yielding an interpretable interface between external information acquisition and deductive reasoning. This design provides deterministic evaluation via unique solutions, and a theoretical optimal query count for measuring efficient tool use. We show that ZebraArena…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
