From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models
Ziyan Kuang, Feiyu Zhu, Maowei Jiang, Yanzhao Lai, Zelin Wang, Zhitong Wang, Meikang Qiu, Jiajia Huang, Min Peng, Qianqian Xie, Sophia Ananiadou

TL;DR
This paper introduces FinCDM, a novel cognitive diagnosis framework for evaluating financial LLMs at the knowledge-skill level, revealing hidden gaps and behavioral patterns overlooked by traditional benchmarks.
Contribution
The paper presents FinCDM, the first cognitive diagnosis evaluation framework for financial LLMs, along with a comprehensive, expert-annotated financial skills dataset derived from CPA exams.
Findings
FinCDM uncovers hidden knowledge gaps in financial LLMs.
It identifies under-tested areas like tax and regulatory reasoning.
The framework reveals behavioral clusters among models.
Abstract
Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking other essentials for real-world applications. To address these gaps, we introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs, enabling the evaluation of LLMs at the knowledge-skill level, identifying what financial skills and knowledge they have or lack based on their response patterns across skill-tagged tasks, rather than a single aggregated number. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStock Market Forecasting Methods
