No More, No Less: Task Alignment in Terminal Agents

Sina Mavali; David Pape; Jonathan Evertz; Samira Abedini; Devansh Srivastav; Thorsten Eisenhofer; Sahar Abdelnabi; Lea Sch\"onherr

arXiv:2605.12233·cs.LG·May 13, 2026

No More, No Less: Task Alignment in Terminal Agents

Sina Mavali, David Pape, Jonathan Evertz, Samira Abedini, Devansh Srivastav, Thorsten Eisenhofer, Sahar Abdelnabi, Lea Sch\"onherr

PDF

TL;DR

This paper introduces TAB, a benchmark for evaluating terminal agents' ability to selectively interpret relevant cues in complex tasks, revealing a gap between task capability and alignment.

Contribution

The paper presents TAB, a novel benchmark with 89 tasks designed to measure how well agents can distinguish relevant instructions from irrelevant information.

Findings

01

Current agents show high task completion but poor task alignment.

02

Suppressing distractors also reduces the use of necessary cues.

03

Agents need to selectively interpret instructions rather than blindly follow or ignore.

Abstract

Terminal agents are increasingly capable of executing complex, long-horizon tasks autonomously from a single user prompt. To do so, they must interpret instructions encountered in the environment (e.g., README files, code comments, stack traces) and determine their relevance to the task. This creates a fundamental challenge: relevant cues must be followed to complete a task, whereas irrelevant or misleading ones must be ignored. Existing benchmarks do not capture this ability. An agent may appear capable by blindly following all instructions, or appear robust by ignoring them altogether. We introduce TAB (Task Alignment Benchmark), a suite of 89 terminal tasks derived from Terminal-Bench 2.1. Each task is intentionally underspecified, with missing information provided as a necessary cue embedded in a natural environmental artifact, alongside a plausible but irrelevant distractor.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.