Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark

Zhiqi Yu; Xingping Liu; Haobin Mao; Mingshuo Liu; Long Chen; Jack Xin; Yifeng Yu

arXiv:2603.00895·cs.LG·March 3, 2026

Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark

Zhiqi Yu, Xingping Liu, Haobin Mao, Mingshuo Liu, Long Chen, Jack Xin, Yifeng Yu

PDF

Open Access

TL;DR

This large-scale study evaluates AI's ability to grade handwritten college calculus work, demonstrating strong alignment with human graders and proposing a benchmark for future research in AI-based mathematical assessment.

Contribution

The paper introduces a comprehensive dataset, evaluation framework, and benchmark for AI grading of handwritten mathematics, addressing real-world challenges and proposing practical design principles.

Findings

01

AI grading aligns well with TA scores and student feedback.

02

Most AI-generated feedback is rated as correct or acceptable.

03

The study identifies key failure modes and proposes solutions.

Abstract

Grading in large undergraduate STEM courses often yields minimal feedback due to heavy instructional workloads. We present a large-scale empirical study of AI grading on real, handwritten single-variable calculus work from UC Irvine. Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions from nearly 800 students. In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review, finding strong alignment with TA scoring and a large majority of AI-generated feedback rated as correct or acceptable across quizzes. Beyond calculus, this setting highlights core challenges in OCR-conditioned mathematical reasoning and partial-credit assessment. We analyze key failure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Teaching and Learning Programming · Academic integrity and plagiarism