FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models

Dmitry Stanishevskii; Nini Kamkia; Alexey Khoroshilov; Dmitry Zmitrovich; Denis Kokosinskii; Zhirayr Hayrapetyan; Andrei Kalmykov

arXiv:2605.15482·cs.CL·May 18, 2026

FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models

Dmitry Stanishevskii, Nini Kamkia, Alexey Khoroshilov, Dmitry Zmitrovich, Denis Kokosinskii, Zhirayr Hayrapetyan, Andrei Kalmykov

PDF

1 Repo

TL;DR

FINESSE-Bench is a comprehensive hierarchical benchmark suite designed to evaluate large language models' financial domain knowledge and technical reasoning across various professional levels and tasks.

Contribution

It introduces a new suite of eight specialized benchmarks with 3,993 questions, enabling detailed assessment of financial expertise and reasoning in LLMs.

Findings

01

Enables evaluation of domain breadth and difficulty degradation.

02

Assesses computational skills and model behavior in financial contexts.

03

Provides an automated scoring scheme for diverse answer formats.

Abstract

Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

limexailab/FINESSE-Bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.