The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors

Li Lucy; Albert Zhang; Nathan Anderson; Ryan Knight; Kyle Lo

arXiv:2603.00925·cs.CL·March 3, 2026

The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors

Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo

PDF

Open Access 5 Datasets

TL;DR

This study evaluates 11 vision-language models on a math education benchmark, revealing significant underperformance in understanding and diagnosing student errors, especially for struggling students, highlighting the need for specialized development.

Contribution

It provides the first extensive, year-long analysis of VLMs in educational contexts, emphasizing their limitations in error diagnosis and student support.

Findings

01

All models underperform on student work requiring pedagogical help.

02

Models struggle most with questions about student error assessment.

03

VLMs need alternative development incentives for educational applications.

Abstract

Effective mathematics education requires identifying and responding to students' mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems. We find that models' weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTeaching and Learning Programming · Intelligent Tutoring Systems and Adaptive Learning · Explainable Artificial Intelligence (XAI)