Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems
Anubha Kabra, Mehar Bhatia, Yaman Kumar, Junyi Jessy Li, Rajiv Ratn, Shah

TL;DR
This paper introduces a comprehensive adversarial evaluation toolkit for automatic essay scoring systems, revealing their over-stability and highlighting the need for more holistic assessment methods.
Contribution
It proposes a model-agnostic adversarial testing scheme and metrics for AES systems, addressing the lack of holistic evaluation across multiple essay features.
Findings
AES models are highly overstable to content modifications.
Irrelevant content can increase automated scores.
Human raters struggle to detect adversarial content.
Abstract
Automatic scoring engines have been used for scoring approximately fifteen million test-takers in just the last three years. This number is increasing further due to COVID-19 and the associated automation of education and testing. Despite such wide usage, the AI-based testing literature of these "intelligent" models is highly lacking. Most of the papers proposing new models rely only on quadratic weighted kappa (QWK) based agreement with human raters for showing model efficacy. However, this effectively ignores the highly multi-feature nature of essay scoring. Essay scoring depends on features like coherence, grammar, relevance, sufficiency and, vocabulary. To date, there has been no study testing Automated Essay Scoring: AES systems holistically on all these features. With this motivation, we propose a model agnostic adversarial evaluation scheme and associated metrics for AES systems…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
