Short-answer scoring with ensembles of pretrained language models
Christopher Ormerod

TL;DR
This paper explores the use of ensembles of pretrained transformer models for short-answer scoring, achieving state-of-the-art results with large ensembles but noting practical limitations.
Contribution
It demonstrates that ensembles of large pretrained transformers can outperform individual models in short-answer scoring tasks.
Findings
Larger models perform slightly better individually.
Ensembles of large models achieve state-of-the-art results.
Large ensembles are impractical for production use.
Abstract
We investigate the effectiveness of ensembles of pretrained transformer-based language models on short answer questions using the Kaggle Automated Short Answer Scoring dataset. We fine-tune a collection of popular small, base, and large pretrained transformer-based language models, and train one feature-base model on the dataset with the aim of testing ensembles of these models. We used an early stopping mechanism and hyperparameter optimization in training. We observe that generally that the larger models perform slightly better, however, they still fall short of state-of-the-art results one their own. Once we consider ensembles of models, there are ensembles of a number of large networks that do produce state-of-the-art results, however, these ensembles are too large to realistically be put in a production environment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsEarly Stopping
