Evaluation and Benchmarking Suite for Financial Large Language Models and Agents

Shengyuan Lin; Kaiwen He; Jaisal Patel; Qinchuan Zhang; Chris Ding; James Tang; Keyi Wang; Yupeng Cao; Yan Wang; Kairong Xiao; Vincent Caldeira; Matt White; Xiao-Yang Liu Yanglet

arXiv:2602.19073·cs.CE·February 24, 2026

Evaluation and Benchmarking Suite for Financial Large Language Models and Agents

Shengyuan Lin, Kaiwen He, Jaisal Patel, Qinchuan Zhang, Chris Ding, James Tang, Keyi Wang, Yupeng Cao, Yan Wang, Kairong Xiao, Vincent Caldeira, Matt White, Xiao-Yang Liu Yanglet

PDF

Open Access

TL;DR

This paper introduces a comprehensive evaluation and benchmarking suite for financial large language models and agents, aiming to improve their reliability, governance, and application in the financial industry.

Contribution

It presents an open platform with evaluation tools, governance frameworks, and leaderboards specifically designed for FinLLMs and FinAgents, advancing financial AI research and deployment.

Findings

01

Development of an evaluation pipeline and governance framework

02

Launch of a FinLLM Leaderboard with HuggingFace

03

Facilitation of quantitative and qualitative analysis of FinLLMs and FinAgents

Abstract

Over the past three years, the financial services industry has witnessed Large Language Models (LLMs) and agents transitioning from the exploration stage to readiness and governance stages. Financial large language models (FinLLMs), such as open FinGPT and proprietary BloombergGPT , have great potential in financial applications, including retrieving real-time data, tutoring, analyzing sentiment of social media, analyzing SEC filings, and agentic trading. However, general-purpose LLMs and agents lack financial expertise and often struggle to handle complex financial reasoning. This paper presents an evaluation and benchmarking suite that covers the lifecycle of FinLLMs and FinAgents. This suite led by SecureFinAI Lab includes an evaluation pipeline and a governance framework collaborating with Linux Foundation and PyTorch Foundation, a FinLLM Leaderboard with HuggingFace, an AgentOps…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStock Market Forecasting Methods · FinTech, Crowdfunding, Digital Finance · Financial Reporting and XBRL