The Impact of LLM Self-Consistency and Reasoning Effort on Automated Scoring Accuracy and Cost
Scott Frohn

TL;DR
This study investigates how self-consistency and reasoning effort influence the accuracy and cost of automated scoring with large language models, highlighting optimal configurations for efficiency and performance.
Contribution
It demonstrates that strategic model selection and reasoning settings outperform ensembling, providing insights into cost-effective high-accuracy LLM scoring methods.
Findings
Temperature sampling improves accuracy over deterministic calls.
Ensemble size beyond one does not significantly enhance accuracy.
Higher reasoning effort correlates with increased scoring accuracy.
Abstract
Strategic model selection and reasoning settings are more effective than ensembling for optimizing automated scoring with large language models (LLMs). We examined self-consistency (intra-model majority voting) and reasoning effort for scoring conversation-based assessment items in high school mathematics, evaluating 900 student conversations against human-scored ground truths using frontier and low-cost models from OpenAI and Google. Temperature sampling significantly improved accuracy over deterministic calls, but increasing ensemble size (j = 1 to 7) produced no significant gains. Higher reasoning effort showed a significant positive linear trend with scoring accuracy, though the benefit varied by model family. An efficiency frontier analysis identified Gemini 3.1 Pro Preview at low reasoning as the most accurate but costly configuration; GPT-5.4 Nano and Mini with no reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
