Characteristics of hand and machine-assigned scores to college students' answers to open-ended tasks
Stephen P. Klein

TL;DR
This study shows that machine scoring of open-ended college exam responses is highly consistent with human graders, correlates well with academic and standardized test scores, and does not introduce bias, making it suitable for large-scale assessments.
Contribution
The paper demonstrates that machine scoring is a reliable, valid, and unbiased method for grading open-ended responses in higher education assessments.
Findings
High inter-reader agreement in human scoring
Machine scores correlate strongly with human scores
Machine scoring does not increase score disparities across groups
Abstract
Assessment of learning in higher education is a critical concern to policy makers, educators, parents, and students. And, doing so appropriately is likely to require including constructed response tests in the assessment system. We examined whether scoring costs and other concerns with using open-end measures on a large scale (e.g., turnaround time and inter-reader consistency) could be addressed by machine grading the answers. Analyses with 1359 students from 14 colleges found that two human readers agreed highly with each other in the scores they assigned to the answers to three types of open-ended questions. These reader assigned scores also agreed highly with those assigned by a computer. The correlations of the machine-assigned scores with SAT scores, college grades, and other measures were comparable to the correlations of these variables with the hand-assigned scores. Machine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
