Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays

Haowei Hua (1); Hong Jiao (2); Xinyi Wang (3) ((1) Princeton University; (2) University of Maryland; College Park; (3) University of Maryland; College Park & Beijing Normal University)

arXiv:2510.22830·cs.CL·November 20, 2025

Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays

Haowei Hua (1), Hong Jiao (2), Xinyi Wang (3) ((1) Princeton University, (2) University of Maryland, College Park, (3) University of Maryland, College Park & Beijing Normal University)

PDF

TL;DR

This paper investigates using generative language models for automated scoring of long essays by leveraging summarization and prompting techniques, significantly improving scoring accuracy over traditional encoder-based models.

Contribution

It introduces a novel approach employing generative models for long essay scoring, overcoming token length limitations of previous methods.

Findings

01

QWK increased from 0.822 to 0.8878

02

Generative models outperform encoder-based models on long essays

03

Summarization and prompting enhance scoring accuracy

Abstract

BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.