What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Ivan Bercovich

arXiv:2604.28093·cs.AI·May 1, 2026

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Ivan Bercovich

PDF

TL;DR

This paper provides guidelines for creating terminal-agent benchmarks that are adversarial, difficult, and legible, emphasizing the importance of conceptual difficulty over environmental tricks to ensure reliable evaluation of language models.

Contribution

It introduces a set of principles and failure mode analyses for designing more robust and meaningful terminal-agent benchmark tasks, moving beyond prompt-like task creation.

Findings

01

Over 15% of popular terminal-agent tasks are reward-hackable.

02

Most benchmark failures stem from treating tasks like prompts rather than adversarial challenges.

03

Real difficulty in benchmarks is conceptual, not just environmental complexity.

Abstract

Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review of the verification logic. This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench. Most people write benchmark tasks the way they write prompts. They shouldn't. A prompt is designed to help the agent succeed; a benchmark is designed to find out if it can. We argue that good tasks are adversarial, difficult, and legible, and that a large class of common failure modes -- AI-generated instructions, over-prescriptive specifications, clerical difficulty, oracle solutions that assume hidden knowledge, tests that validate the wrong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.