Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive   Investigation of Accuracy, Fairness, and Generalizability

Kaixun Yang; Mladen Rakovi\'c; Yuyang Li; Quanlong Guan; Dragan; Ga\v{s}evi\'c; Guanliang Chen

arXiv:2401.05655·cs.CL·January 12, 2024·2 cites

Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability

Kaixun Yang, Mladen Rakovi\'c, Yuyang Li, Quanlong Guan, Dragan, Ga\v{s}evi\'c, Guanliang Chen

PDF

Open Access 1 Repo 1 Video

TL;DR

This study evaluates nine AES methods on accuracy, fairness, and generalizability using a large dataset, revealing trade-offs between prompt-specific and cross-prompt models and highlighting the importance of model choice for equitable assessment.

Contribution

It provides a comprehensive comparison of AES models across multiple metrics, emphasizing the impact of model type and prompt specificity on bias and performance.

Findings

01

Prompt-specific models outperform cross-prompt models in accuracy.

02

Prompt-specific models show greater bias related to economic status.

03

Traditional models with engineered features can achieve high accuracy and fairness.

Abstract

Automatic Essay Scoring (AES) is a well-established educational pursuit that employs machine learning to evaluate student-authored essays. While much effort has been made in this area, current research primarily focuses on either (i) boosting the predictive accuracy of an AES model for a specific prompt (i.e., developing prompt-specific models), which often heavily relies on the use of the labeled data from the same target prompt; or (ii) assessing the applicability of AES models developed on non-target prompts to the intended target prompt (i.e., developing the AES models in a cross-prompt setting). Given the inherent bias in machine learning and its potential impact on marginalized groups, it is imperative to investigate whether such bias exists in current AES methods and, if identified, how it intervenes with an AES model's accuracy and generalizability. Thus, our study aimed to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

carsonyang518/aaai24-aes-afg
pytorchOfficial

Videos

Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability· underline

Taxonomy

TopicsOnline Learning and Analytics · Text Readability and Simplification · Topic Modeling