Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring
Xuansheng Wu, Padmaja Pravin Saraf, Gyeonggeon Lee, Ehsan Latif,, Ninghao Liu, Xiaoming Zhai

TL;DR
This paper investigates how large language models score student responses compared to humans, revealing an alignment gap and showing that better-designed rubrics can improve scoring accuracy and consistency.
Contribution
It uncovers the grading rubrics used by LLMs, compares them with human rubrics, and demonstrates that aligning these rubrics enhances scoring accuracy.
Findings
LLMs exhibit a notable gap in scoring alignment with humans.
Incorporating high-quality rubrics improves LLM scoring accuracy.
LLMs tend to use shortcuts, bypassing deeper reasoning.
Abstract
Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI's scoring process mirrors that of humans or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score students' written responses to science tasks and their alignment with human scores. We also examine whether enhancing the alignments can improve scoring accuracy. Specifically, we prompt LLMs to generate analytic rubrics that they use to assign scores and study the alignment gap with human grading rubrics. Based on a series of experiments with various configurations of LLM settings, we reveal a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
