Learning to Love Edge Cases in Formative Math Assessment: Using the   AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy

Owen Henkel; Hannah Horne-Robinson; Maria Dyshel; Nabil Ch; Baptiste; Moreau-Pernet; Ralph Abood

arXiv:2409.17904·cs.AI·September 27, 2024·2 cites

Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy

Owen Henkel, Hannah Horne-Robinson, Maria Dyshel, Nabil Ch, Baptiste, Moreau-Pernet, Ralph Abood

PDF

Open Access

TL;DR

This paper presents the AMMORE dataset and demonstrates that chain-of-thought prompting with large language models significantly improves grading accuracy for challenging math answers, enhancing formative assessment in education.

Contribution

The study introduces the AMMORE dataset and evaluates LLM-based grading approaches, showing that chain-of-thought prompting greatly enhances accuracy and validity in math assessment.

Findings

01

Chain-of-thought prompting scores 92% of edge cases accurately.

02

Overall grading accuracy improves from 98.7% to 99.9%.

03

Reduced student mastery misclassification from 6.9% to 2.6%.

Abstract

This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a learning platform used by students in several African countries and conducts two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. The AMMORE dataset enables various potential analyses and provides an important resource for researching student math acquisition in understudied, real-world, educational contexts. In experiment 1 we use a variety of LLM-driven approaches, including zero-shot, few-shot, and chain-of-thought prompting, to grade the 1% of student answers that a rule-based classifier fails to grade accurately. We find that the best-performing approach -- chain-of-thought prompting -- accurately scored 92% of these edge cases, effectively boosting the overall accuracy of the grading from 98.7% to 99.9%. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Assessment and Pedagogy · Higher Education Learning Practices