Achieving Human Level Partial Credit Grading of Written Responses to Physics Conceptual Question using GPT-3.5 with Only Prompt Engineering
Zhongzhou Chen, Tong Wan

TL;DR
This study demonstrates that GPT-3.5, guided by a novel prompt engineering method called scaffolded chain of thought, can grade physics responses with accuracy comparable to human raters, using only prompt design.
Contribution
Introduces scaffolded chain of thought prompting for GPT-3.5, significantly improving auto-grading accuracy of physics responses without additional training.
Findings
GPT-3.5 with scaffolded COT achieves 70-80% agreement with human raters.
Grading accuracy improves by 20-30% over conventional COT prompts.
AI grading performance is comparable to human inter-rater reliability.
Abstract
Large language modules (LLMs) have great potential for auto-grading student written responses to physics problems due to their capacity to process and generate natural language. In this explorative study, we use a prompt engineering technique, which we name "scaffolded chain of thought (COT)", to instruct GPT-3.5 to grade student written responses to a physics conceptual question. Compared to common COT prompting, scaffolded COT prompts GPT-3.5 to explicitly compare student responses to a detailed, well-explained rubric before generating the grading outcome. We show that when compared to human raters, the grading accuracy of GPT-3.5 using scaffolded COT is 20% - 30% higher than conventional COT. The level of agreement between AI and human raters can reach 70% - 80%, comparable to the level between two human raters. This shows promise that an LLM-based AI grader can achieve human-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Online Learning and Analytics
