BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, Ji Liu

TL;DR
BizFinBench is a comprehensive, business-oriented benchmark designed to evaluate large language models in real-world financial tasks, highlighting their strengths and weaknesses across multiple financial reasoning and information processing dimensions.
Contribution
This paper introduces BizFinBench, the first specialized benchmark for assessing LLMs in financial applications, along with a novel evaluation method called IteraJudge to reduce bias.
Findings
No model dominates all tasks.
Proprietary models excel in reasoning; smaller models lag.
Performance varies significantly across tasks.
Abstract
Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The authors provided a comprehensive 7,605 annotated instances across 5 major financial dimensions and 9 granular categories, ensuring broad domain and task coverage. 2. The authors conducted extensive experiments, evaluating 30 close and open source models in total. 3. The authors proposed a new evaluation method, IteraJudge, to reduce bias during evaluation.
1. Existing works[1, 2] should be discussed in the related work section. 2. Task difficulty: Overall, the goal of a benchmark is to facilitate the development and improvement of future models. However, in its current form, this benchmark is not sufficiently challenging. Current models perform very well on 5 out of 9 tasks (the SOTA model achieving over 90 on 4 tasks and 87 on another), raising main concerns about the dataset’s usefulness. 3. It is not clear what current models failed; any error
(1) The paper presents BizFinBench a Chinese financial benchmark consisting of original prompts from actual users. This differs from the vast-majority of synthetic datasets flooding recently. (2) The paper uses LLM-Judges for evaluation, which is backed by comparing different LLM-Judges. However, this could have been better by comparing with a human baseline.
(1) Some subsets of the dataset are already highly saturated with top-performing models scoring over 90. (2) The contribution (dataset) is limited to a small sub-group of NLP, the Chinese community interested in Finance. (3) The paper would be better with an error analysis that provides a comparison with closed and open models. This will provide better guidance on what model makers need to improve. (4) The evaluation setup says maximum token is limited to 1024 tokens, Im not sure if this enou
1. The dataset is sourced directly from trading platforms with large user bases, which ensures its grounding in real-world financial business scenarios. 2. The paper proposes IteraJudge, an LLM-as-a-Judge evaluation method capable of providing reliable and multi-dimensional assessments for complex problems.
1. While the paper claims that BizFinBench features contextual complexity and adversarial robustness, it does not provide any statistical metrics to quantify such complexity, nor does it include examples illustrating how noise or adversarial elements are introduced into the data. 2. The presentation of related work contains notable omissions in both form and content: (1) In Table 1, the "task" column omits Financial numerical reasoning, despite this being explicitly included in the paper's
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
