XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning
Zhihan Zhang, Yixin Cao, Lizi Liao

TL;DR
XFinBench is a comprehensive benchmark with 4,235 examples designed to evaluate large language models' ability to solve complex, knowledge-intensive financial problems involving multimodal data, revealing current limitations and areas for improvement.
Contribution
The paper introduces XFinBench, a novel benchmark for assessing LLMs in complex financial reasoning, and provides extensive experimental analysis on 18 models highlighting their strengths and weaknesses.
Findings
O1 is the best text-only model with 67.3% accuracy.
Models lag behind human experts by 54.8% in accuracy.
Knowledge augmentation improves small open-source models' performance.
Abstract
Solving financial problems demands complex reasoning, multimodal data processing, and a broad technical understanding, presenting unique challenges for current large language models (LLMs). We introduce XFinBench, a novel benchmark with 4,235 examples designed to evaluate LLM's ability in solving complex, knowledge-intensive financial problems across diverse graduate-level finance topics with multi-modal context. We identify five core capabilities of LLMs using XFinBench, i.e, terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling. Upon XFinBench, we conduct extensive experiments on 18 leading models. The result shows that o1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%, especially in temporal reasoning and scenario planning capabilities. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
