Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis
Md Talha Mohsin

TL;DR
This study systematically evaluates five transformer-based large language models on financial report analysis, revealing significant variability in their performance, behavior, and reliability, emphasizing the need for comprehensive evaluation frameworks in high-stakes financial NLP tasks.
Contribution
It provides a controlled, multi-faceted evaluation of LLMs in financial NLP, highlighting behavioral differences and the importance of nuanced assessment methods.
Findings
Models differ in relevance, accuracy, and clarity.
Automated metrics show systematic lexical and semantic differences.
Response stability varies across models and prompts.
Abstract
Large language models (LLMs) are increasingly used to support the analysis of complex financial disclosures, yet their reliability, behavioral consistency, and transparency remain insufficiently understood in high-stakes settings. This paper presents a controlled evaluation of five transformer-based LLMs applied to question answering over the Business sections of U.S. 10-K filings. To capture complementary aspects of model behavior, we combine human evaluation, automated similarity metrics, and behavioral diagnostics under standardized and context-controlled prompting conditions. Human assessments indicate that models differ in their average performance across qualitative dimensions such as relevance, completeness, clarity, conciseness, and factual accuracy, though inter-rater agreement is modest, reflecting the subjective nature of these criteria. Automated metrics reveal systematic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStock Market Forecasting Methods
