WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

Thomson Yen; Julian Poeltl; Harshith Srinivas Gear; Yilin Meng; Joshua Fan; Adam Shen; Yili Liu; Ali Bauyrzhan; Siri Du; Haoyang Liu; Daniel Guetta; and Hongseok Namkoong

arXiv:2605.22664·cs.AI·May 22, 2026

WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

Thomson Yen, Julian Poeltl, Harshith Srinivas Gear, Yilin Meng, Joshua Fan, Adam Shen, Yili Liu, Ali Bauyrzhan, Siri Du, Haoyang Liu, Daniel Guetta, and Hongseok Namkoong

PDF

TL;DR

This paper introduces WorkstreamBench, a benchmark for evaluating LLM agents on complex, end-to-end spreadsheet tasks in finance, highlighting current limitations in producing professional-quality outputs.

Contribution

It presents one of the first comprehensive evaluations of LLM agents on real-world financial spreadsheet workflows, with a new multidimensional assessment taxonomy.

Findings

01

Claude family agents produce the most professional-looking spreadsheets

02

Agents often fall short of professional finance standards

03

Performance degrades sharply with increased complexity

Abstract

LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. To address this gap, we provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Since deliverables therein are routinely reviewed and revised by multiple stakeholders, judging their quality necessarily involves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.