TL;DR
FINESSE-Bench is a comprehensive hierarchical benchmark suite designed to evaluate large language models' financial domain knowledge and technical reasoning across various professional levels and tasks.
Contribution
It introduces a new suite of eight specialized benchmarks with 3,993 questions, enabling detailed assessment of financial expertise and reasoning in LLMs.
Findings
Enables evaluation of domain breadth and difficulty degradation.
Assesses computational skills and model behavior in financial contexts.
Provides an automated scoring scheme for diverse answer formats.
Abstract
Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
