Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey

Jiachen Zhu; Menghui Zhu; Renting Rui; Rong Shan; Congmin Zheng; Bo Chen; Yunjia Xi; Jianghao Lin; Weiwen Liu; Ruiming Tang; Yong Yu; Weinan Zhang

arXiv:2506.11102·cs.CL·June 16, 2025

Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey

Jiachen Zhu, Menghui Zhu, Renting Rui, Rong Shan, Congmin Zheng, Bo Chen, Yunjia Xi, Jianghao Lin, Weiwen Liu, Ruiming Tang, Yong Yu, Weinan Zhang

PDF

Open Access

TL;DR

This survey systematically analyzes evaluation methods for LLM-based AI agents, distinguishing them from chatbots and providing a comprehensive framework to guide future research and benchmarking practices.

Contribution

It introduces an analytical framework that differentiates AI agents from LLM chatbots and categorizes evaluation benchmarks based on environmental and internal capabilities.

Findings

01

Provides a detailed differentiation of AI agents and chatbots.

02

Categorizes evaluation benchmarks by external and internal factors.

03

Outlines future directions for evaluation methodologies.

Abstract

The advent of large language models (LLMs), such as GPT, Gemini, and DeepSeek, has significantly advanced natural language processing, giving rise to sophisticated chatbots capable of diverse language-related tasks. The transition from these traditional LLM chatbots to more advanced AI agents represents a pivotal evolutionary step. However, existing evaluation frameworks often blur the distinctions between LLM chatbots and AI agents, leading to confusion among researchers selecting appropriate benchmarks. To bridge this gap, this paper introduces a systematic analysis of current evaluation approaches, grounded in an evolutionary perspective. We provide a detailed analytical framework that clearly differentiates AI agents from LLM chatbots along five key aspects: complex environment, multi-source instructor, dynamic feedback, multi-modal perception, and advanced capability. Further, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical and Computational Modeling

MethodsDropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Cosine Annealing · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · Byte Pair Encoding · Layer Normalization · Dense Connections · Softmax