Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
Haowei Hua (1), Hong Jiao (2), Xinyi Wang (3) ((1) Princeton University, (2) University of Maryland, College Park, (3) University of Maryland, College Park & Beijing Normal University)

TL;DR
This paper investigates using generative language models for automated scoring of long essays by leveraging summarization and prompting techniques, significantly improving scoring accuracy over traditional encoder-based models.
Contribution
It introduces a novel approach employing generative models for long essay scoring, overcoming token length limitations of previous methods.
Findings
QWK increased from 0.822 to 0.8878
Generative models outperform encoder-based models on long essays
Summarization and prompting enhance scoring accuracy
Abstract
BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
