What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design
Ivan Bercovich

TL;DR
This paper provides guidelines for creating terminal-agent benchmarks that are adversarial, difficult, and legible, emphasizing the importance of conceptual difficulty over environmental tricks to ensure reliable evaluation of language models.
Contribution
It introduces a set of principles and failure mode analyses for designing more robust and meaningful terminal-agent benchmark tasks, moving beyond prompt-like task creation.
Findings
Over 15% of popular terminal-agent tasks are reward-hackable.
Most benchmark failures stem from treating tasks like prompts rather than adversarial challenges.
Real difficulty in benchmarks is conceptual, not just environmental complexity.
Abstract
Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review of the verification logic. This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench. Most people write benchmark tasks the way they write prompts. They shouldn't. A prompt is designed to help the agent succeed; a benchmark is designed to find out if it can. We argue that good tasks are adversarial, difficult, and legible, and that a large class of common failure modes -- AI-generated instructions, over-prescriptive specifications, clerical difficulty, oracle solutions that assume hidden knowledge, tests that validate the wrong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
