WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
Thomson Yen, Julian Poeltl, Harshith Srinivas Gear, Yilin Meng, Joshua Fan, Adam Shen, Yili Liu, Ali Bauyrzhan, Siri Du, Haoyang Liu, Daniel Guetta, and Hongseok Namkoong

TL;DR
This paper introduces WorkstreamBench, a benchmark for evaluating LLM agents on complex, end-to-end spreadsheet tasks in finance, highlighting current limitations in producing professional-quality outputs.
Contribution
It presents one of the first comprehensive evaluations of LLM agents on real-world financial spreadsheet workflows, with a new multidimensional assessment taxonomy.
Findings
Claude family agents produce the most professional-looking spreadsheets
Agents often fall short of professional finance standards
Performance degrades sharply with increased complexity
Abstract
LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. To address this gap, we provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Since deliverables therein are routinely reviewed and revised by multiple stakeholders, judging their quality necessarily involves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
