Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng, Mingzhe Lu, Zikun Zhu, Adina Yakefu, Shuxin Zheng

TL;DR
Finch is a comprehensive benchmark derived from authentic enterprise finance and accounting workflows, designed to evaluate AI agents' ability to handle complex, multimodal, real-world tasks.
Contribution
The paper introduces a new benchmark dataset from real enterprise environments, combining workflow construction, expert annotation, and evaluation of advanced AI systems.
Findings
GPT-5.1 passes 38.4% of workflows in human evaluation.
The benchmark includes 172 workflows with 384 tasks and 27 million spreadsheet cells.
Real-world enterprise workflows pose significant challenges for current AI agents.
Abstract
We introduce FinWorkBench (a.k.a. Finch) for evaluating AI agents on real-world, enterprise-grade finance and accounting workflows that interleave data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces from Enron (15,000 files and 500,000 emails) and other financial institutions, covering the period 2000--2025 and preserving the in-the-wild messiness of multimodal artifacts such as tables and charts across diverse domains including budgeting, trading, asset management, and operational management. We propose a workflow construction process that combines LLM-assisted mining of workflows from authentic enterprise environments with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
