Empirical Comparison of Encoder-Based Language Models and Feature-Based Supervised Machine Learning Approaches to Automated Scoring of Long Essays

Kuo Wang (1); Haowei Hua (2); Pengfei Yan (3); Hong Jiao (3); Dan Song (4) ((1) Southern Methodist University; (2) Princeton University; (3) University of Maryland; (4) University of Iowa)

arXiv:2601.02659·cs.CL·January 8, 2026

Empirical Comparison of Encoder-Based Language Models and Feature-Based Supervised Machine Learning Approaches to Automated Scoring of Long Essays

Kuo Wang (1), Haowei Hua (2), Pengfei Yan (3), Hong Jiao (3), Dan Song (4) ((1) Southern Methodist University, (2) Princeton University, (3) University of Maryland, (4) University of Iowa)

PDF

Open Access

TL;DR

This study compares encoder-based language models and feature-based supervised machine learning methods for automated long essay scoring, finding ensemble models of embeddings with gradient boosting outperform individual models.

Contribution

It introduces an ensemble-of-embeddings approach combining multiple language models with gradient boosting, improving long essay scoring accuracy.

Findings

01

Ensemble-of-embeddings model outperforms individual language models.

02

Gradient boosting enhances scoring performance.

03

Long essay scoring benefits from combining multiple model representations.

Abstract

Long context may impose challenges for encoder-only language models in text processing, specifically for automated scoring of essays. This study trained several commonly used encoder-based language models for automated scoring of long essays. The performance of these trained models was evaluated and compared with the ensemble models built upon the base language models with a token limit of 512?. The experimented models include BERT-based models (BERT, RoBERTa, DistilBERT, and DeBERTa), ensemble models integrating embeddings from multiple encoder models, and ensemble models of feature-based supervised machine learning models, including Gradient-Boosted Decision Trees, eXtreme Gradient Boosting, and Light Gradient Boosting Machine. We trained, validated, and tested each model on a dataset of 17,307 essays, with an 80%/10%/10% split, and evaluated model performance using Quadratic Weighted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Artificial Intelligence in Healthcare and Education