Comparing Human and Automated Evaluation of Open-Ended Student Responses to Questions of Evolution
Michael J Wiser, Louise S Mead, James J Smith, Robert T Pennock

TL;DR
This study compares human and machine learning-based scoring of student responses on evolution questions, finding high reliability but systematic differences, suggesting ML is better suited for formative assessment rather than final grading.
Contribution
It evaluates EvoGrader's effectiveness in scoring student responses and highlights its potential and limitations compared to human scoring.
Findings
High inter-rater reliability between human and ML scores
Systematic differences suggest ML should be used for formative assessment
ML scoring is less suitable for summative evaluation
Abstract
Written responses can provide a wealth of data in understanding student reasoning on a topic. Yet they are time- and labor-intensive to score, requiring many instructors to forego them except as limited parts of summative assessments at the end of a unit or course. Recent developments in Machine Learning (ML) have produced computational methods of scoring written responses for the presence or absence of specific concepts. Here, we compare the scores from one particular ML program -- EvoGrader -- to human scoring of responses to structurally- and content-similar questions that are distinct from the ones the program was trained on. We find that there is substantial inter-rater reliability between the human and ML scoring. However, sufficient systematic differences remain between the human and ML scoring that we advise only using the ML scoring for formative, rather than summative,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
