When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

Jin Seong; Wencke Liermann; Minho Kim; Jong-hun Shin; Soojong Lim

arXiv:2604.22774·cs.CY·April 28, 2026

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

Jin Seong, Wencke Liermann, Minho Kim, Jong-hun Shin, Soojong Lim

PDF

TL;DR

This paper identifies over-correction by Vision-Language Models in multi-line handwritten math OCR and introduces PINK, a semantic metric using LLMs to penalize this issue, improving evaluation accuracy.

Contribution

It is the first systematic study of multi-line handwritten math OCR, revealing over-correction as a critical failure mode and proposing PINK for better evaluation.

Findings

01

PINK aligns better with human judgment than BLEU.

02

Models like GPT-4o are heavily penalized for over-correction.

03

Gemini 2.5 Flash is the most faithful transcriber among evaluated models.

Abstract

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student's work, these models often "fix" errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.