FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

Michael Krumdick; Varshini Reddy; Shivani Chaudhary; William Day; Maarij Ahmed; Hayan Haqqi; Muhammad Ahsen Fahim; Hanzallah Amjad; Ahmad Orakzai; Aqsa Gul; Chris Tanner

arXiv:2604.05912·cs.CL·April 8, 2026

FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

Michael Krumdick, Varshini Reddy, Shivani Chaudhary, William Day, Maarij Ahmed, Hayan Haqqi, Muhammad Ahsen Fahim, Hanzallah Amjad, Ahmad Orakzai, Aqsa Gul, Chris Tanner

PDF

TL;DR

FrontierFinance introduces a comprehensive long-horizon benchmark of 25 complex financial tasks, developed with professionals, to evaluate AI models' performance on real-world financial modeling.

Contribution

It provides the first industry-aligned, detailed benchmark for assessing AI performance on long-term financial tasks, including structured evaluation rubrics and human baseline comparisons.

Findings

01

Human experts outperform AI models on the benchmark.

02

Current state-of-the-art AI systems are less likely to produce client-ready outputs.

03

The benchmark reflects real-world financial workflows and requires extensive skilled labor.

Abstract

As concerns surrounding AI-driven labor displacement intensify in knowledge-intensive sectors, existing benchmarks fail to measure performance on tasks that define practical professional expertise. Finance, in particular, has been identified as a domain with high AI exposure risk, yet lacks robust benchmarks to track real-world developments. This gap is compounded by the absence of clear accountability mechanisms in current Large Language Model (LLM) deployments. To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete. Developed with financial professionals, the benchmark reflects industry-standard financial modeling workflows and is paired with detailed rubrics for structured evaluation. We engage human experts to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.