When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR
Jin Seong, Wencke Liermann, Minho Kim, Jong-hun Shin, Soojong Lim

TL;DR
This paper identifies over-correction by Vision-Language Models in multi-line handwritten math OCR and introduces PINK, a semantic metric using LLMs to penalize this issue, improving evaluation accuracy.
Contribution
It is the first systematic study of multi-line handwritten math OCR, revealing over-correction as a critical failure mode and proposing PINK for better evaluation.
Findings
PINK aligns better with human judgment than BLEU.
Models like GPT-4o are heavily penalized for over-correction.
Gemini 2.5 Flash is the most faithful transcriber among evaluated models.
Abstract
Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student's work, these models often "fix" errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
