Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications
Jodi M. Casabianca, Daniel F. McCaffrey, Matthew S. Johnson, Naim, Alper, and Vladimir Zubenko

TL;DR
This paper examines the validity of using generative AI for scoring constructed responses, comparing it to traditional and feature-based methods, and proposes best practices for validity evidence collection.
Contribution
It highlights the differences in validity evidence requirements for generative AI scoring systems and offers guidelines for supporting score validity in high-stakes testing.
Findings
Generative AI scoring requires more extensive validity evidence than feature-based NLP.
Constructed response scores from AI can be validated using multiple evidence sources.
Combining AI scores from different sources may improve construct coverage.
Abstract
The rapid advancements in large language models and generative artificial intelligence (AI) capabilities are making their broad application in the high-stakes testing context more likely. Use of generative AI in the scoring of constructed responses is particularly appealing because it reduces the effort required for handcrafting features in traditional AI scoring and might even outperform those methods. The purpose of this paper is to highlight the differences in the feature-based and generative AI applications in constructed response scoring systems and propose a set of best practices for the collection of validity evidence to support the use and interpretation of constructed response scores from scoring systems using generative AI. We compare the validity evidence needed in scoring systems using human ratings, feature-based natural language processing AI scoring engines, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTechnology and Data Analysis
MethodsSparse Evolutionary Training
