Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

Haoyu Dong; Pengkun Zhang; Yan Gao; Xuanyu Dong; Yilin Cheng; Mingzhe Lu; Zikun Zhu; Adina Yakefu; Shuxin Zheng

arXiv:2512.13168·cs.AI·April 17, 2026

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng, Mingzhe Lu, Zikun Zhu, Adina Yakefu, Shuxin Zheng

PDF

1 Repo 1 Datasets

TL;DR

Finch is a comprehensive benchmark derived from authentic enterprise finance and accounting workflows, designed to evaluate AI agents' ability to handle complex, multimodal, real-world tasks.

Contribution

The paper introduces a new benchmark dataset from real enterprise environments, combining workflow construction, expert annotation, and evaluation of advanced AI systems.

Findings

01

GPT-5.1 passes 38.4% of workflows in human evaluation.

02

The benchmark includes 172 workflows with 384 tasks and 27 million spreadsheet cells.

03

Real-world enterprise workflows pose significant challenges for current AI agents.

Abstract

We introduce FinWorkBench (a.k.a. Finch) for evaluating AI agents on real-world, enterprise-grade finance and accounting workflows that interleave data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces from Enron (15,000 files and 500,000 emails) and other financial institutions, covering the period 2000--2025 and preserving the in-the-wild messiness of multimodal artifacts such as tables and charts across diverse domains including budgeting, trading, asset management, and operational management. We propose a workflow construction process that combines LLM-assisted mining of workflows from authentic enterprise environments with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

finworkbench/Finch
github

Datasets

FinWorkBench/Finch
dataset· 23k dl
23k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.