BizFinBench.v2: A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment

Xin Guo; Rongjunchen Zhang; Guilong Lu; Xuntao Guo; Shuai Jia; Zhi Yang; Liwen Zhang

arXiv:2601.06401·cs.AI·January 13, 2026

BizFinBench.v2: A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment

Xin Guo, Rongjunchen Zhang, Guilong Lu, Xuntao Guo, Shuai Jia, Zhi Yang, Liwen Zhang

PDF

Open Access 1 Datasets

TL;DR

BizFinBench.v2 is a comprehensive, real-world benchmark for evaluating large language models' financial capabilities using authentic Chinese and U.S. market data, addressing previous limitations of simulated and static assessments.

Contribution

Introduces BizFinBench.v2, the first authentic, dual-mode bilingual benchmark with online assessment for expert-level financial tasks, covering 29,578 Q&A pairs across core business scenarios.

Findings

01

ChatGPT-5 achieves 61.5% accuracy on main tasks.

02

DeepSeek-R1 outperforms other commercial LLMs in online tasks.

03

Error analysis reveals specific capability gaps in current models.

Abstract

Large language models have undergone rapid evolution, emerging as a pivotal technology for intelligence in financial operations. However, existing benchmarks are often constrained by pitfalls such as reliance on simulated or general-purpose samples and a focus on singular, offline static scenarios. Consequently, they fail to align with the requirements for authenticity and real-time responsiveness in financial services, leading to a significant discrepancy between benchmark performance and actual operational efficacy. To address this, we introduce BizFinBench.v2, the first large-scale evaluation benchmark grounded in authentic business data from both Chinese and U.S. equity markets, integrating online assessment. We performed clustering analysis on authentic user queries from financial platforms, resulting in eight fundamental tasks and two online tasks across four core business…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

HiThink-Research/BizFinBench.v2
dataset· 228 dl
228 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStock Market Forecasting Methods · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)