Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering
Jingyue Li, Andr\'e Storhaug

TL;DR
This paper reviews current evaluation practices of Agentic AI in software engineering, highlighting limitations and proposing guidelines for reproducible, explainable, and effective assessments, including sharing TAR trajectories and interaction data.
Contribution
It introduces a set of guidelines and a proof-of-concept case study to improve evaluation transparency and comparability of Agentic AI in SE.
Findings
Current evaluations often lack reproducibility and transparency.
Sharing TAR trajectories enhances analysis of Agentic AI approaches.
Guidelines can improve the quality and comparability of future evaluations.
Abstract
With the advancement of Agentic AI, researchers are increasingly leveraging autonomous agents to address challenges in software engineering (SE). However, the large language models (LLMs) that underpin these agents often function as black boxes, making it difficult to justify the superiority of Agentic AI approaches over baselines. Furthermore, missing information in the evaluation design description frequently renders the reproduction of results infeasible. To synthesize current evaluation practices for Agentic AI in SE, this study analyzes 18 papers on the topic, published or accepted by ICSE 2026, ICSE 2025, FSE 2025, ASE 2025, and ISSTA 2025. The analysis identifies prevailing approaches and their limitations in evaluating Agentic AI for SE, both in current research and potential future studies. To address these shortcomings, this position paper proposes a set of guidelines and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
