TL;DR
This paper introduces MathEDU, a dataset for evaluating AI models' ability to provide reliable, targeted feedback on student mathematical problem-solving processes, highlighting current limitations and future needs.
Contribution
The study presents MathEDU, a new dataset and systematic evaluation of models for correctness classification, error detection, and feedback generation in math education.
Findings
Fine-tuning improves correctness classification and error detection.
Generated feedback often lacks specificity and is overly verbose.
Current models do not match teacher-written feedback quality.
Abstract
The increasing reliance on Large Language Models (LLMs) across various domains extends to education, where students progressively use generative AI as a tool for learning. While prior work has examined LLMs' mathematical ability, their reliability in grading authentic student problem-solving processes and delivering effective feedback remains underexplored. This study introduces MathEDU, a dataset consisting of student problem-solving processes in mathematics and corresponding teacher-written feedback. We systematically evaluate the reliability of various models across three hierarchical tasks: answer correctness classification, error identification, and feedback generation. Experimental results show that fine-tuning strategies effectively improve performance in classifying correctness and locating erroneous steps. However, the generated feedback across models shows a considerable gap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
