AgentQuest: A Modular Benchmark Framework to Measure Progress and   Improve LLM Agents

Luca Gioacchini; Giuseppe Siracusano; Davide Sanvito; Kiril; Gashteovski; David Friede; Roberto Bifulco; Carolin Lawrence

arXiv:2404.06411·cs.AI·April 10, 2024·2 cites

AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents

Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito, Kiril, Gashteovski, David Friede, Roberto Bifulco, Carolin Lawrence

PDF

Open Access 1 Repo 1 Video

TL;DR

AgentQuest is a modular benchmarking framework for LLM agents that introduces new evaluation metrics and facilitates progress tracking and architecture refinement in multi-step reasoning tasks.

Contribution

It provides a flexible, extensible platform with novel metrics to reliably measure and improve LLM agent performance.

Findings

01

New metrics effectively track agent progress.

02

Framework helps identify failure points and improve architectures.

03

Significant performance improvements achieved.

Abstract

The advances made by Large Language Models (LLMs) have led to the pursuit of LLM agents that can solve intricate, multi-step reasoning tasks. As with any research pursuit, benchmarking and evaluation are key corner stones to efficient and reliable progress. However, existing benchmarks are often narrow and simply compute overall task success. To face these issues, we propose AgentQuest -- a framework where (i) both benchmarks and metrics are modular and easily extensible through well documented and easy-to-use APIs; (ii) we offer two new evaluation metrics that can reliably track LLM agent progress while solving a task. We exemplify the utility of the metrics on two use cases wherein we identify common failure points and refine the agent architecture to obtain a significant performance increase. Together with the research community, we hope to extend AgentQuest further and therefore we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nec-research/agentquest
noneOfficial

Videos

AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents· underline

Taxonomy

TopicsBusiness Process Modeling and Analysis · Multi-Agent Systems and Negotiation · Semantic Web and Ontologies