TL;DR
This paper introduces Agentic Harness Engineering (AHE), an automated, observability-driven method for evolving coding-agent harnesses that improve performance and transferability across models.
Contribution
AHE provides a novel closed-loop framework with three observability pillars enabling autonomous harness evolution without trial-and-error.
Findings
AHE iterations improve pass@1 from 69.7% to 77.0% on Terminal-Bench 2.
Evolved harness components transfer effectively without re-evolution.
Cross-family gains of +5.1 to +10.1 percentage points across model types.
Abstract
Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
