AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Priyam Sahoo; Gaurav Mittal; Xiaomin Li; Shengjie Ma; Benjamin Steenhoek; Pingping Lin; Yu Hu

arXiv:2605.12925·cs.SE·May 14, 2026

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Priyam Sahoo, Gaurav Mittal, Xiaomin Li, Shengjie Ma, Benjamin Steenhoek, Pingping Lin, Yu Hu

PDF

1 Repo

TL;DR

This paper introduces AgentLens, a framework for process-level assessment of SWE agents, revealing that many passing solutions are actually Lucky Passes with flawed behaviors, challenging the traditional pass/fail evaluation.

Contribution

The paper presents AgentLens, a novel process-level evaluation framework and dataset that identifies and characterizes Lucky Passes in SWE-agent trajectories, improving assessment granularity.

Findings

01

10.7% of passing trajectories are Lucky Passes with flawed behaviors.

02

Lucky Passes account for up to 23.2% of trajectories across models.

03

Quality scores significantly differ from pass/fail labels, revealing evaluation flaws.

Abstract

Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-only view treats a principled solution and a chaotic trial-and-error process as equivalent. We show that this equivalence is empirically false. We evaluate 2,614 OpenHands trajectories from eight model backends on 60 SWE-bench Verified tasks. Of these, 47 have enough passing trajectories to construct task-level process references, yielding a 1,815-trajectory evaluation subset. Among passing trajectories in this subset, 10.7% exhibit behavior we call a Lucky Pass: regression cycles, blind retries, missing verification, or temporally disordered exploration, implementation, and verification. We introduce AgentLens, a framework for process-level assessment of SWE-agent trajectories, and release AgentLens-Bench, a dataset of 1,815 trajectories annotated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/code-agent-state-trajectories
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.