Towards a Realistic Long-Term Benchmark for Open-Web Research Agents
Peter M\"uhlbacher, Nikos I. Bosse, Lawrence Phillips

TL;DR
This paper introduces a benchmark for evaluating large language model agents on real-world open-web research tasks of economic value, highlighting the performance of various architectures and models in conducting analyst-style research.
Contribution
It establishes a new benchmark for long-term, realistic open-web research tasks and evaluates multiple LLM architectures, emphasizing the importance of subtask delegation and real-world applicability.
Findings
Claude-3.5 Sonnet and o1-preview outperformed GPT-4o-based agents
ReAct architecture with subtask delegation performed best
First in-depth assessment of LLM agents on open-web, economically valuable research
Abstract
We present initial results of a forthcoming benchmark for evaluating LLM agents on white-collar tasks of economic value. We evaluate agents on real-world "messy" open-web research tasks of the type that are routine in finance and consulting. In doing so, we lay the groundwork for an LLM agent evaluation suite where good performance directly corresponds to a large economic and societal impact. We built and tested several agent architectures with o1-preview, GPT-4o, Claude-3.5 Sonnet, Llama 3.1 (405b), and GPT-4o-mini. On average, LLM agents powered by Claude-3.5 Sonnet and o1-preview substantially outperformed agents using GPT-4o, with agents based on Llama 3.1 (405b) and GPT-4o-mini lagging noticeably behind. Across LLMs, a ReAct architecture with the ability to delegate subtasks to subagents performed best. In addition to quantitative evaluations, we qualitatively assessed the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation · Scientific Computing and Data Management
MethodsLLaMA
