Beyond Grading Accuracy: Exploring Alignment of TAs and LLMs
Matthijs Jansen op de Haar, Nacir Bouali, Faizan Ahmed

TL;DR
This study evaluates open-source LLMs for grading UML class diagrams, comparing their performance to TAs at the criterion level, and demonstrates their potential to support automated grading with high accuracy and correlation.
Contribution
It introduces a criterion-level grading pipeline using open-source LLMs for UML diagrams, addressing transparency and cost issues in automated assessment.
Findings
Per-criterion accuracy up to 88.56%
Pearson correlation up to 0.78 with human grades
Optimal model combining best LLMs approaches TA performance
Abstract
In this paper, we investigate the potential of open-source Large Language Models (LLMs) for grading Unified Modeling Language (UML) class diagrams. In contrast to existing work, which primarily evaluates proprietary LLMs, we focus on non-proprietary models, making our approach suitable for universities where transparency and cost are critical. Additionally, existing studies assess performance over complete diagrams rather than individual criteria, offering limited insight into how automated grading aligns with human evaluation. To address these gaps, we propose a grading pipeline in which student-generated UML class diagrams are independently evaluated by both teaching assistants (TAs) and LLMs. Grades are then compared at the level of individual criteria. We evaluate this pipeline through a quantitative study of 92 UML class diagrams from a software design course, comparing TA grades…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Teaching and Learning Programming · Innovative Teaching and Learning Methods
