Predicting Short Response Ratings with Non-Content Related Features: A Hierarchical Modeling Approach
Aubrey Condor

TL;DR
This study investigates how non-content features like response length and grammar influence human ratings of open-ended responses, revealing potential biases in scoring that could impact high-stakes assessments.
Contribution
It demonstrates that non-content features significantly predict response ratings, highlighting the need to scrutinize scoring practices for potential biases beyond content.
Findings
Non-content features predict ratings significantly
Response length and grammar are influential predictors
Potential bias in high-stakes scoring scenarios
Abstract
We explore whether the human ratings of open ended responses can be explained with non-content related features, and if such effects vary across different mathematics-related items. When scoring is rigorously defined and rooted in a measurement framework, educators intend that the features of a response which are indicative of the respondent's level of ability are contributing to scores. However, we find that features such as response length, a grammar score of the response, and a metric relating to key phrase frequency are significant predictors for response ratings. Although our findings are not causally conclusive, they may propel us to be more critical of he way in which we assess open ended responses, especially in high stakes scenarios. Educators take great care to provide unbiased, consistent ratings, but it may be that extraneous features unrelated to those which were intended…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTechnology and Data Analysis
