XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning

Zhihan Zhang; Yixin Cao; Lizi Liao

arXiv:2508.15861·cs.CL·August 25, 2025

XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning

Zhihan Zhang, Yixin Cao, Lizi Liao

PDF

1 Datasets 1 Video

TL;DR

XFinBench is a comprehensive benchmark with 4,235 examples designed to evaluate large language models' ability to solve complex, knowledge-intensive financial problems involving multimodal data, revealing current limitations and areas for improvement.

Contribution

The paper introduces XFinBench, a novel benchmark for assessing LLMs in complex financial reasoning, and provides extensive experimental analysis on 18 models highlighting their strengths and weaknesses.

Findings

01

O1 is the best text-only model with 67.3% accuracy.

02

Models lag behind human experts by 54.8% in accuracy.

03

Knowledge augmentation improves small open-source models' performance.

Abstract

Solving financial problems demands complex reasoning, multimodal data processing, and a broad technical understanding, presenting unique challenges for current large language models (LLMs). We introduce XFinBench, a novel benchmark with 4,235 examples designed to evaluate LLM's ability in solving complex, knowledge-intensive financial problems across diverse graduate-level finance topics with multi-modal context. We identify five core capabilities of LLMs using XFinBench, i.e, terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling. Upon XFinBench, we conduct extensive experiments on 18 leading models. The result shows that o1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%, especially in temporal reasoning and scenario planning capabilities. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Zhihan/XFinBench
dataset· 135 dl
135 dl

Videos

XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning· underline