TL;DR
This paper introduces AgentLens, a framework for process-level assessment of SWE agents, revealing that many passing solutions are actually Lucky Passes with flawed behaviors, challenging the traditional pass/fail evaluation.
Contribution
The paper presents AgentLens, a novel process-level evaluation framework and dataset that identifies and characterizes Lucky Passes in SWE-agent trajectories, improving assessment granularity.
Findings
10.7% of passing trajectories are Lucky Passes with flawed behaviors.
Lucky Passes account for up to 23.2% of trajectories across models.
Quality scores significantly differ from pass/fail labels, revealing evaluation flaws.
Abstract
Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-only view treats a principled solution and a chaotic trial-and-error process as equivalent. We show that this equivalence is empirically false. We evaluate 2,614 OpenHands trajectories from eight model backends on 60 SWE-bench Verified tasks. Of these, 47 have enough passing trajectories to construct task-level process references, yielding a 1,815-trajectory evaluation subset. Among passing trajectories in this subset, 10.7% exhibit behavior we call a Lucky Pass: regression cycles, blind retries, missing verification, or temporally disordered exploration, implementation, and verification. We introduce AgentLens, a framework for process-level assessment of SWE-agent trajectories, and release AgentLens-Bench, a dataset of 1,815 trajectories annotated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
