Evaluation and Benchmarking Suite for Financial Large Language Models and Agents
Shengyuan Lin, Kaiwen He, Jaisal Patel, Qinchuan Zhang, Chris Ding, James Tang, Keyi Wang, Yupeng Cao, Yan Wang, Kairong Xiao, Vincent Caldeira, Matt White, Xiao-Yang Liu Yanglet

TL;DR
This paper introduces a comprehensive evaluation and benchmarking suite for financial large language models and agents, aiming to improve their reliability, governance, and application in the financial industry.
Contribution
It presents an open platform with evaluation tools, governance frameworks, and leaderboards specifically designed for FinLLMs and FinAgents, advancing financial AI research and deployment.
Findings
Development of an evaluation pipeline and governance framework
Launch of a FinLLM Leaderboard with HuggingFace
Facilitation of quantitative and qualitative analysis of FinLLMs and FinAgents
Abstract
Over the past three years, the financial services industry has witnessed Large Language Models (LLMs) and agents transitioning from the exploration stage to readiness and governance stages. Financial large language models (FinLLMs), such as open FinGPT and proprietary BloombergGPT , have great potential in financial applications, including retrieving real-time data, tutoring, analyzing sentiment of social media, analyzing SEC filings, and agentic trading. However, general-purpose LLMs and agents lack financial expertise and often struggle to handle complex financial reasoning. This paper presents an evaluation and benchmarking suite that covers the lifecycle of FinLLMs and FinAgents. This suite led by SecureFinAI Lab includes an evaluation pipeline and a governance framework collaborating with Linux Foundation and PyTorch Foundation, a FinLLM Leaderboard with HuggingFace, an AgentOps…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStock Market Forecasting Methods · FinTech, Crowdfunding, Digital Finance · Financial Reporting and XBRL
