S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings
Tasfia Seuti, Sagnik Ray Choudhury

TL;DR
This paper introduces S-GRADES, a comprehensive benchmark unifying diverse student response assessment datasets to evaluate and improve the generalization of automated grading models across different evaluative settings.
Contribution
The paper presents S-GRADES, a new open-source benchmark that consolidates multiple datasets and evaluation protocols for student response grading, enabling standardized and extensible assessment.
Findings
Large language models show varying performance across datasets.
Exemplar selection impacts grading accuracy and transferability.
Benchmark reveals gaps in model reliability and generalization.
Abstract
Evaluating student responses, from long essays to short factual answers, is a key challenge in educational NLP. Automated Essay Scoring (AES) focuses on holistic writing qualities such as coherence and argumentation, while Automatic Short Answer Grading (ASAG) emphasizes factual correctness and conceptual understanding. Despite their shared goal, these paradigms have progressed in isolation with fragmented datasets, inconsistent metrics, and separate communities. We introduce S-GRADES (Studying Generalization of Student Response Assessments in Diverse Evaluative Settings), a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and reproducible evaluation protocols. The benchmark is fully open-source and designed for extensibility, enabling continuous integration of new datasets and evaluation settings. To demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Text Readability and Simplification
