TRAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis

Myeongsoo Kim; Dingmin Wang; Siwei Cui; Farima Farmahinifarahani; Shweta Garg; Baishakhi Ray; Terry Yue Zhuo; Rajdeep Mukherjee; and Varun Kumar

arXiv:2603.24631·cs.SE·March 27, 2026

TRAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis

Myeongsoo Kim, Dingmin Wang, Siwei Cui, Farima Farmahinifarahani, Shweta Garg, Baishakhi Ray, Terry Yue Zhuo, Rajdeep Mukherjee, and Varun Kumar

PDF

Open Access

TL;DR

TRAJEVAL is a diagnostic framework that decomposes code agent trajectories into interpretable stages, revealing inefficiencies and failure modes, and enabling targeted improvements in agent performance.

Contribution

It introduces a novel, fine-grained diagnostic method for analyzing code agent failures, moving beyond binary metrics to interpretability and actionable insights.

Findings

01

All agents examine ~22x more functions than necessary

02

GPT-5 locates relevant code but targets edits incorrectly

03

Qwen-32B fails at file discovery entirely

Abstract

Code agents can autonomously resolve GitHub issues, yet when they fail, current evaluation provides no visibility into where or why. Metrics such as Pass@1 collapse an entire execution into a single binary outcome, making it difficult to identify where and why the agent went wrong. To address this limitation, we introduce TRAJEVAL, a diagnostic framework that decomposes agent trajectories into three interpretable stages: search (file localization), read (function comprehension), and edit (modification targeting). For each stage, we compute precision and recall by comparing against reference patches. Analyzing 16,758 trajectories across three agent architectures and seven models, we find universal inefficiencies (all agents examine approximately 22x more functions than necessary) yet distinct failure modes: GPT-5 locates relevant code but targets edits incorrectly, while Qwen-32B fails…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software System Performance and Reliability