Evaluating Students' Open-ended Written Responses with LLMs: Using the   RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large

Jussi S. Jauhiainen; Agust\'in Garagorry Guerra

arXiv:2405.05444·cs.CL·May 10, 2024·6 cites

Evaluating Students' Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large

Jussi S. Jauhiainen, Agust\'in Garagorry Guerra

PDF

Open Access

TL;DR

This study assesses the effectiveness of various large language models in grading students' open-ended responses using the RAG framework, highlighting variability in consistency and grading outcomes across models and settings.

Contribution

It introduces the application of the RAG framework with multiple LLMs for educational assessment and provides a comprehensive evaluation of their grading consistency and reliability.

Findings

01

Significant variation in grading consistency among models.

02

Temperature settings influence evaluation outcomes.

03

Further research needed on accuracy and cost-effectiveness.

Abstract

Evaluating open-ended written examination responses from students is an essential yet time-intensive task for educators, requiring a high degree of effort, consistency, and precision. Recent developments in Large Language Models (LLMs) present a promising opportunity to balance the need for thorough evaluation with efficient use of educators' time. In our study, we explore the effectiveness of LLMs ChatGPT-3.5, ChatGPT-4, Claude-3, and Mistral-Large in assessing university students' open-ended answers to questions made about reference material they have studied. Each model was instructed to evaluate 54 answers repeatedly under two conditions: 10 times (10-shot) with a temperature setting of 0.0 and 10 times with a temperature of 0.5, expecting a total of 1,080 evaluations per model and 4,320 evaluations across all models. The RAG (Retrieval Augmented Generation) framework was used as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · SAS software applications and methods

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Weight Decay · Multi-Head Attention · Attention Dropout · Dropout · Residual Connection · Softmax · WordPiece