Benchmarking LLM Agents for Wealth-Management Workflows
Rory Milsom

TL;DR
This paper develops a benchmark for evaluating large language model agents in wealth management workflows, highlighting the importance of workflow reliability and autonomy levels for effective financial assistant tasks.
Contribution
It introduces a comprehensive benchmark with synthetic data, explicit criteria, and autonomy variants to assess LLM agents' performance in wealth management tasks.
Findings
Agents are limited more by workflow reliability than reasoning.
Autonomy level significantly impacts agent performance.
Incorrect model evaluations have hindered benchmarking efforts.
Abstract
Modern work relies on an assortment of digital collaboration tools, yet routine processes continue to suffer from human error and delay. To address this gap, this dissertation extends TheAgentCompany with a finance-focused environment and investigates whether a general purpose LLM agent can complete representative wealth-management tasks both accurately and economically. This study introduces synthetic domain data, enriches colleague simulations, and prototypes an automatic task-generation pipeline. The study aims to create and assess an evaluation set that can meaningfully measure an agent's fitness for assistant-level wealth management work. We construct a benchmark of 12 task-pairs for wealth management assistants spanning retrieval, analysis, and synthesis/communication, with explicit acceptance criteria and deterministic graders. We seeded a set of new finance-specific data and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Business Process Modeling and Analysis · Artificial Intelligence in Law
