TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

Zhaoyang Chu; Jiarui Hu; Xingyu Jiang; Pengyu Zou; Han Li; Chao Peng; Peter O'Hearn; Earl T. Barr; Mark Harman; Federica Sarro; He Ye

arXiv:2605.22535·cs.AI·May 22, 2026

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O'Hearn, Earl T. Barr, Mark Harman, Federica Sarro, He Ye

PDF

1 Repo 1 Datasets

TL;DR

TerminalWorld is a scalable benchmark created from real-world terminal recordings, evaluating agent performance on authentic workflows across diverse categories, revealing current systems' limitations.

Contribution

It introduces an automated engine to generate a large, authentic terminal task benchmark from real recordings, enabling scalable evaluation of terminal agents.

Findings

01

Maximum pass rate of 62.5% on verified tasks.

02

Weak correlation (r=0.20) with existing benchmarks.

03

Captures real-world terminal capabilities distinct from curated benchmarks.

Abstract

We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

EuniAI/TerminalWorld
github

Datasets

EuniAI/TerminalWorld
dataset· 3.6k dl
3.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.