No More, No Less: Task Alignment in Terminal Agents
Sina Mavali, David Pape, Jonathan Evertz, Samira Abedini, Devansh Srivastav, Thorsten Eisenhofer, Sahar Abdelnabi, Lea Sch\"onherr

TL;DR
This paper introduces TAB, a benchmark for evaluating terminal agents' ability to selectively interpret relevant cues in complex tasks, revealing a gap between task capability and alignment.
Contribution
The paper presents TAB, a novel benchmark with 89 tasks designed to measure how well agents can distinguish relevant instructions from irrelevant information.
Findings
Current agents show high task completion but poor task alignment.
Suppressing distractors also reduces the use of necessary cues.
Agents need to selectively interpret instructions rather than blindly follow or ignore.
Abstract
Terminal agents are increasingly capable of executing complex, long-horizon tasks autonomously from a single user prompt. To do so, they must interpret instructions encountered in the environment (e.g., README files, code comments, stack traces) and determine their relevance to the task. This creates a fundamental challenge: relevant cues must be followed to complete a task, whereas irrelevant or misleading ones must be ignored. Existing benchmarks do not capture this ability. An agent may appear capable by blindly following all instructions, or appear robust by ignoring them altogether. We introduce TAB (Task Alignment Benchmark), a suite of 89 terminal tasks derived from Terminal-Bench 2.1. Each task is intentionally underspecified, with missing information provided as a necessary cue embedded in a natural environmental artifact, alongside a plausible but irrelevant distractor.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
