The Impact of LLM Self-Consistency and Reasoning Effort on Automated Scoring Accuracy and Cost

Scott Frohn

arXiv:2604.26954·cs.CY·May 1, 2026

The Impact of LLM Self-Consistency and Reasoning Effort on Automated Scoring Accuracy and Cost

Scott Frohn

PDF

TL;DR

This study investigates how self-consistency and reasoning effort influence the accuracy and cost of automated scoring with large language models, highlighting optimal configurations for efficiency and performance.

Contribution

It demonstrates that strategic model selection and reasoning settings outperform ensembling, providing insights into cost-effective high-accuracy LLM scoring methods.

Findings

01

Temperature sampling improves accuracy over deterministic calls.

02

Ensemble size beyond one does not significantly enhance accuracy.

03

Higher reasoning effort correlates with increased scoring accuracy.

Abstract

Strategic model selection and reasoning settings are more effective than ensembling for optimizing automated scoring with large language models (LLMs). We examined self-consistency (intra-model majority voting) and reasoning effort for scoring conversation-based assessment items in high school mathematics, evaluating 900 student conversations against human-scored ground truths using frontier and low-cost models from OpenAI and Google. Temperature sampling significantly improved accuracy over deterministic calls, but increasing ensemble size (j = 1 to 7) produced no significant gains. Higher reasoning effort showed a significant positive linear trend with scoring accuracy, though the benefit varied by model family. An efficiency frontier analysis identified Gemini 3.1 Pro Preview at low reasoning as the most accurate but costly configuration; GPT-5.4 Nano and Mini with no reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.