ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
Yuxiang Lai, Peng Xia, Haonian Ji, Kaiwen Xiong, Kaide Zeng, Jiaqi Liu, Fang Wu, Jike Zhong, Zeyu Zheng, Cihang Xie, Huaxiu Yao

TL;DR
ClawForge introduces a benchmark framework for evaluating command-line agents in realistic, state-conflicted workflows, revealing diverse failure modes and limited accuracy across models.
Contribution
The paper presents ClawForge, a generator-backed, executable benchmark framework for testing agents in state-conflicted command-line workflows, enabling detailed failure analysis.
Findings
Best model achieves only 45.3% strict accuracy.
Wrong-state replacement remains below 17% for all models.
Model performance varies significantly based on state inspection behavior.
Abstract
Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do not systematically test how agents handle pre-existing partial, stale, or conflicting artifacts. We present \textbf{ClawForge}, a generator-backed benchmark framework for executable command-line workflows under state conflict. The framework compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
