Evaluating Generative AI for CS1 Code Grading: Direct vs Reverse Methods
Ahmad Memon, Abdallah Mohamed

TL;DR
This paper compares two AI-based methods for grading introductory programming assignments, evaluating their accuracy, efficiency, and consistency against human graders, and explores their potential for scalable, fair assessment.
Contribution
It introduces a novel 'Reverse' grading approach using LLMs, compares it with the traditional 'Direct' method, and assesses their effectiveness in automated CS1 code grading.
Findings
Reverse method offers more fine-grained assessment.
Direct method is faster and simpler to implement.
Both methods benefit from careful prompt engineering.
Abstract
Manual grading of programming assignments in introductory computer science courses can be time-consuming and prone to inconsistencies. While unit testing is commonly used for automatic evaluation, it typically follows a binary pass/fail model and does not give partial marks. Recent advances in large language models (LLMs) offer the potential for automated, scalable, and more objective grading. This paper compares two AI-based grading techniques: \textit{Direct}, where the AI model applies a rubric directly to student code, and \textit{Reverse} (a newly proposed approach), where the AI first fixes errors, then deduces a grade based on the nature and number of fixes. Each method was evaluated on both the instructor's original grading scale and a tenfold expanded scale to assess the impact of range on AI grading accuracy. To assess their effectiveness, AI-assigned scores were evaluated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTeaching and Learning Programming · Intelligent Tutoring Systems and Adaptive Learning · Online Learning and Analytics
