Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications

Fadel M. Megahed; Ying-Ju Chen; L. Allision Jones-Farmer; Younghwa Lee; Jiawei Brooke Wang; Inez M. Zwetsloot

arXiv:2505.14918·cs.CL·December 23, 2025

Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications

Fadel M. Megahed, Ying-Ju Chen, L. Allision Jones-Farmer, Younghwa Lee, Jiawei Brooke Wang, Inez M. Zwetsloot

PDF

Open Access

TL;DR

This paper presents a framework for evaluating the consistency and reliability of large language models in binary text classification, with a case study on financial news sentiment analysis across multiple models.

Contribution

It introduces a psychometrically inspired framework for assessing LLM reliability, including sample size determination and response validity metrics, demonstrated through a comprehensive case study.

Findings

01

Models showed high intra-rater consistency with 90-98% agreement.

02

Smaller models outperformed larger ones in sentiment classification accuracy.

03

Models performed at chance level in predicting actual market movements.

Abstract

This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies