ChatGPT for automated grading of short answer questions in mechanical ventilation
Tejas Jade, Alex Yartsev

TL;DR
This study evaluates ChatGPT 4o's ability to automatically grade short answer questions in postgraduate medical education, revealing significant discrepancies from human grading and cautioning against its use in high-stakes assessments.
Contribution
First systematic evaluation of ChatGPT 4o for grading SAQs in postgraduate medical education, highlighting its limitations and variability compared to human graders.
Findings
ChatGPT awarded lower scores than humans with a mean difference of -1.34 on a 10-point scale.
Poor agreement between ChatGPT and human grading, with ICC of 0.086 and Cohen's kappa of -0.0786.
Most disagreement occurred in evaluative and analytic items, less in checklist and prescriptive items.
Abstract
Standardised tests using short answer questions (SAQs) are common in postgraduate education. Large language models (LLMs) simulate conversational language and interpret unstructured free-text responses in ways aligning with applying SAQ grading rubrics, making them attractive for automated grading. We evaluated ChatGPT 4o to grade SAQs in a postgraduate medical setting using data from 215 students (557 short-answer responses) enrolled in an online course on mechanical ventilation (2020--2024). Deidentified responses to three case-based scenarios were presented to ChatGPT with a standardised grading prompt and rubric. Outputs were analysed using mixed-effects modelling, variance component analysis, intraclass correlation coefficients (ICCs), Cohen's kappa, Kendall's W, and Bland--Altman statistics. ChatGPT awarded systematically lower marks than human graders with a mean difference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
