Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks

Julian Junyan Wang; Victor Xiaoqi Wang

arXiv:2503.16974·q-fin.GN·September 16, 2025·2 cites

Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks

Julian Junyan Wang, Victor Xiaoqi Wang

PDF

Open Access

TL;DR

This paper systematically evaluates the consistency and reproducibility of large language models in finance and accounting tasks, revealing task-dependent variability and the effectiveness of aggregation strategies, with implications for research reliability.

Contribution

It provides the first comprehensive analysis of LLM output consistency across multiple finance tasks and models, highlighting patterns and mitigation strategies.

Findings

01

Binary classification and sentiment analysis show high reproducibility.

02

Aggregation of multiple runs improves consistency and accuracy.

03

Downstream inferences remain robust despite output variability.

Abstract

This study provides the first comprehensive assessment of consistency and reproducibility in Large Language Model (LLM) outputs in finance and accounting research. We evaluate how consistently LLMs produce outputs given identical inputs through extensive experimentation with 50 independent runs across five common tasks: classification, sentiment analysis, summarization, text generation, and prediction. Using three OpenAI models (GPT-3.5-turbo, GPT-4o-mini, and GPT-4o), we generate over 3.4 million outputs from diverse financial source texts and data, covering MD&As, FOMC statements, finance news articles, earnings call transcripts, and financial statements. Our findings reveal substantial but task-dependent consistency, with binary classification and sentiment analysis achieving near-perfect reproducibility, while complex tasks show greater variability. More advanced models do not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStock Market Forecasting Methods · Financial Reporting and XBRL · Auditing, Earnings Management, Governance