SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models
Yuhang Su, Mei Wang, Yaoyao Zhong, Guozhang Li, Shixing Li, Yihan Feng, Hua Huang

TL;DR
This paper introduces SketchJudge, a benchmark for evaluating multimodal large language models' ability to grade and diagnose errors in hand-drawn STEM diagrams, revealing current models' limitations in complex, noisy visual reasoning tasks.
Contribution
The paper presents SketchJudge, a new comprehensive benchmark with diverse hand-drawn diagrams to assess MLLMs' diagnostic and grading capabilities in STEM education contexts.
Findings
MLLMs perform significantly worse than humans on SketchJudge.
The benchmark exposes weaknesses in vision-language alignment for symbolic, noisy sketches.
Current models struggle with structural and semantic reasoning in hand-drawn diagrams.
Abstract
While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual understanding, they often struggle when faced with the unstructured and ambiguous nature of human-generated sketches. This limitation is particularly pronounced in the underexplored task of visual grading, where models should not only solve a problem but also diagnose errors in hand-drawn diagrams. Such diagnostic capabilities depend on complex structural, semantic, and metacognitive reasoning. To bridge this gap, we introduce SketchJudge, a novel benchmark tailored for evaluating MLLMs as graders of hand-drawn STEM diagrams. SketchJudge encompasses 1,015 hand-drawn student responses across four domains: geometry, physics, charts, and flowcharts, featuring diverse stylistic variations and distinct error types. Evaluations on SketchJudge demonstrate that even advanced MLLMs lag significantly behind…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Computational and Text Analysis Methods
