FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

Zhuohan Xie; Daniil Orel; Rushil Thareja; Dhruv Sahnan; Hachem Madmoun; Fan Zhang; Debopriyo Banerjee; Georgi Georgiev; Xueqing Peng; Lingfei Qian; Jimin Huang; Jinyan Su; Aaryamonvikram Singh; Rui Xing; Rania Elbadry; Chen Xu; Haonan Li; Fajri Koto; Ivan Koychev; Tanmoy Chakraborty; Yuxia Wang; Salem Lahlou; Veselin Stoyanov; Sophia Ananiadou; and Preslav Nakov

arXiv:2506.02515·cs.CL·May 1, 2026

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

Zhuohan Xie, Daniil Orel, Rushil Thareja, Dhruv Sahnan, Hachem Madmoun, Fan Zhang, Debopriyo Banerjee, Georgi Georgiev, Xueqing Peng, Lingfei Qian, Jimin Huang, Jinyan Su, Aaryamonvikram Singh, Rui Xing, Rania Elbadry, Chen Xu, Haonan Li, Fajri Koto, Ivan Koychev

PDF

1 Repo

TL;DR

FinChain is a new benchmark designed to evaluate verifiable multi-step symbolic reasoning in finance, highlighting current LLM limitations and aiding development of trustworthy financial AI.

Contribution

It introduces FinChain, a symbolic, verifiable benchmark for financial reasoning, and CHAINEVAL, a measure for reasoning correctness and consistency.

Findings

01

Frontier LLMs show limited symbolic financial reasoning.

02

Domain-adapted models improve reasoning performance.

03

FinChain exposes weaknesses in multi-step financial reasoning.

Abstract

Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning steps required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python code that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. To assess reasoning capacity, we propose CHAINEVAL, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mbzuai-nlp/finchain.git
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.