Log analysis is necessary for credible evaluation of AI agents
Peter Kirgis, Sayash Kapoor, Stephan Rabanser, Nitya Nadgir, Cozmin Ududec, Magda Dubois, JJ Allaire, Conrad Stosz, Marius Hobbhahn, Jacob Steinhardt, and Arvind Narayanan

TL;DR
This paper advocates for systematic log analysis in AI agent evaluation to address validity threats like shortcuts, unpredicted failures, and hidden dangerous actions, enhancing credibility.
Contribution
It introduces a taxonomy of evaluation threats, principles for log analysis, and demonstrates their application on tau-Bench Airline revealing hidden performance issues.
Findings
Pass performance was under-elicited by nearly 50%.
Log analysis revealed deployment failure modes invisible to outcome metrics.
Guidelines for stakeholders to adopt log analysis in evaluation processes.
Abstract
Agent benchmarks typically report only final outcomes: pass or fail. This threatens evaluation credibility in three ways. First, scores may be inflated or deflated by shortcuts and benchmark artifacts, misrepresenting capability. Second, benchmark performance may fail to predict real-world utility due to scaffold limitations and recurring failure modes. Finally, capability scores may conceal dangerous or catastrophic actions taken by the agent. We argue that log analysis -- the systematic tracking and analysis of the inputs, execution, and outputs of an AI agent -- is necessary to overcome these validity threats and promote credible agent evaluation. In this paper, we (1) present a taxonomy of threats to credible evaluation documented through log analysis, and (2) develop a set of guiding principles for log analysis. We illustrate these principles on tau-Bench Airline, revealing that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
