Towards a Realistic Long-Term Benchmark for Open-Web Research Agents

Peter M\"uhlbacher; Nikos I. Bosse; Lawrence Phillips

arXiv:2409.14913·cs.CL·September 26, 2024

Towards a Realistic Long-Term Benchmark for Open-Web Research Agents

Peter M\"uhlbacher, Nikos I. Bosse, Lawrence Phillips

PDF

Open Access

TL;DR

This paper introduces a benchmark for evaluating large language model agents on real-world open-web research tasks of economic value, highlighting the performance of various architectures and models in conducting analyst-style research.

Contribution

It establishes a new benchmark for long-term, realistic open-web research tasks and evaluates multiple LLM architectures, emphasizing the importance of subtask delegation and real-world applicability.

Findings

01

Claude-3.5 Sonnet and o1-preview outperformed GPT-4o-based agents

02

ReAct architecture with subtask delegation performed best

03

First in-depth assessment of LLM agents on open-web, economically valuable research

Abstract

We present initial results of a forthcoming benchmark for evaluating LLM agents on white-collar tasks of economic value. We evaluate agents on real-world "messy" open-web research tasks of the type that are routine in finance and consulting. In doing so, we lay the groundwork for an LLM agent evaluation suite where good performance directly corresponds to a large economic and societal impact. We built and tested several agent architectures with o1-preview, GPT-4o, Claude-3.5 Sonnet, Llama 3.1 (405b), and GPT-4o-mini. On average, LLM agents powered by Claude-3.5 Sonnet and o1-preview substantially outperformed agents using GPT-4o, with agents based on Llama 3.1 (405b) and GPT-4o-mini lagging noticeably behind. Across LLMs, a ReAct architecture with the ability to delegate subtasks to subagents performed best. In addition to quantitative evaluations, we qualitatively assessed the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation · Scientific Computing and Data Management

MethodsLLaMA