AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents
Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito, Kiril, Gashteovski, David Friede, Roberto Bifulco, Carolin Lawrence

TL;DR
AgentQuest is a modular benchmarking framework for LLM agents that introduces new evaluation metrics and facilitates progress tracking and architecture refinement in multi-step reasoning tasks.
Contribution
It provides a flexible, extensible platform with novel metrics to reliably measure and improve LLM agent performance.
Findings
New metrics effectively track agent progress.
Framework helps identify failure points and improve architectures.
Significant performance improvements achieved.
Abstract
The advances made by Large Language Models (LLMs) have led to the pursuit of LLM agents that can solve intricate, multi-step reasoning tasks. As with any research pursuit, benchmarking and evaluation are key corner stones to efficient and reliable progress. However, existing benchmarks are often narrow and simply compute overall task success. To face these issues, we propose AgentQuest -- a framework where (i) both benchmarks and metrics are modular and easily extensible through well documented and easy-to-use APIs; (ii) we offer two new evaluation metrics that can reliably track LLM agent progress while solving a task. We exemplify the utility of the metrics on two use cases wherein we identify common failure points and refine the agent architecture to obtain a significant performance increase. Together with the research community, we hope to extend AgentQuest further and therefore we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBusiness Process Modeling and Analysis · Multi-Agent Systems and Negotiation · Semantic Web and Ontologies
