Language Models are Few-Shot Graders
Chenyan Zhao, Mariana Silva, Seth Poulsen

TL;DR
This paper introduces an LLM-based automatic short answer grading system that outperforms existing models, analyzes different OpenAI models for grading, and explores prompt strategies like RAG and rubrics to improve accuracy.
Contribution
The paper presents a novel LLM-based ASAG pipeline that surpasses previous models and systematically evaluates prompt strategies and model choices for effective automated grading.
Findings
GPT-4o balances accuracy and cost-effectiveness.
RAG-based selection improves grading accuracy.
Rubrics enhance evaluation consistency.
Abstract
Providing evaluations to student work is a critical component of effective student learning, and automating its process can significantly reduce the workload on human graders. Automatic Short Answer Grading (ASAG) systems, enabled by advancements in Large Language Models (LLMs), offer a promising solution for assessing and providing instant feedback for open-ended student responses. In this paper, we present an ASAG pipeline leveraging state-of-the-art LLMs. Our new LLM-based ASAG pipeline achieves better performances than existing custom-built models on the same datasets. We also compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview. Our results demonstrate that GPT-4o achieves the best balance between accuracy and cost-effectiveness. On the other hand, o1-preview, despite higher accuracy, exhibits a larger variance in error that makes it less practical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer
