Evaluating Accounting Reasoning Capabilities of Large Language Models
Jie Zhou, Xin Chen, Jie Zhang, Hai Li, Jie Wang, Zhe Li

TL;DR
This paper assesses the accounting reasoning abilities of large language models, proposing evaluation criteria and benchmarks, revealing that while GPT-4 performs best, models still need improvement for real-world enterprise use.
Contribution
It introduces a systematic framework and benchmarks for evaluating accounting reasoning in large language models, guiding future enhancements.
Findings
GPT-4 shows the strongest accounting reasoning performance.
Prompt design significantly impacts model performance.
Current models are inadequate for real-world enterprise accounting.
Abstract
Large language models are transforming learning, cognition, and research across many fields. Effectively integrating them into professional domains, such as accounting, is a key challenge for enterprise digital transformation. To address this, we define vertical domain accounting reasoning and propose evaluation criteria derived from an analysis of the training data characteristics of representative GLM models. These criteria support systematic study of accounting reasoning and provide benchmarks for performance improvement. Using this framework, we evaluate GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4 on accounting reasoning tasks. Results show that prompt design significantly affects performance, with GPT-4 demonstrating the strongest capability. Despite these gains, current models remain insufficient for real-world enterprise accounting, indicating the need for further optimization to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAccounting Education and Careers · Auditing, Earnings Management, Governance · Financial Reporting and XBRL
