SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models

Yuhang Su; Mei Wang; Yaoyao Zhong; Guozhang Li; Shixing Li; Yihan Feng; Hua Huang

arXiv:2601.06944·cs.CV·January 13, 2026

SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models

Yuhang Su, Mei Wang, Yaoyao Zhong, Guozhang Li, Shixing Li, Yihan Feng, Hua Huang

PDF

Open Access

TL;DR

This paper introduces SketchJudge, a benchmark for evaluating multimodal large language models' ability to grade and diagnose errors in hand-drawn STEM diagrams, revealing current models' limitations in complex, noisy visual reasoning tasks.

Contribution

The paper presents SketchJudge, a new comprehensive benchmark with diverse hand-drawn diagrams to assess MLLMs' diagnostic and grading capabilities in STEM education contexts.

Findings

01

MLLMs perform significantly worse than humans on SketchJudge.

02

The benchmark exposes weaknesses in vision-language alignment for symbolic, noisy sketches.

03

Current models struggle with structural and semantic reasoning in hand-drawn diagrams.

Abstract

While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual understanding, they often struggle when faced with the unstructured and ambiguous nature of human-generated sketches. This limitation is particularly pronounced in the underexplored task of visual grading, where models should not only solve a problem but also diagnose errors in hand-drawn diagrams. Such diagnostic capabilities depend on complex structural, semantic, and metacognitive reasoning. To bridge this gap, we introduce SketchJudge, a novel benchmark tailored for evaluating MLLMs as graders of hand-drawn STEM diagrams. SketchJudge encompasses 1,015 hand-drawn student responses across four domains: geometry, physics, charts, and flowcharts, featuring diverse stylistic variations and distinct error types. Evaluations on SketchJudge demonstrate that even advanced MLLMs lag significantly behind…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Computational and Text Analysis Methods