Evaluation and Benchmarking of LLM Agents: A Survey
Mahmoud Mohammadi, Yipeng Li, Jane Lo, Wendy Yip

TL;DR
This survey reviews the current landscape of LLM agent evaluation, proposing a taxonomy and highlighting challenges and future directions for systematic and realistic assessment methods.
Contribution
It introduces a two-dimensional taxonomy for LLM agent evaluation and discusses enterprise-specific challenges and future research directions.
Findings
Proposes a taxonomy organizing evaluation objectives and processes.
Highlights enterprise challenges like data access and safety.
Identifies future needs for holistic and scalable evaluation.
Abstract
The rise of LLM-based agents has opened new frontiers in AI applications, yet evaluating these agents remains a complex and underdeveloped area. This survey provides an in-depth overview of the emerging field of LLM agent evaluation, introducing a two-dimensional taxonomy that organizes existing work along (1) evaluation objectives -- what to evaluate, such as agent behavior, capabilities, reliability, and safety -- and (2) evaluation process -- how to evaluate, including interaction modes, datasets and benchmarks, metric computation methods, and tooling. In addition to taxonomy, we highlight enterprise-specific challenges, such as role-based access to data, the need for reliability guarantees, dynamic and long-horizon interactions, and compliance, which are often overlooked in current research. We also identify future research directions, including holistic, more realistic, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
