Can Vision-Language Models Evaluate Handwritten Math?
Oikantik Nath, Hanani Bathina, Mohammed Safi Ur Rahman Khan, Mitesh M., Khapra

TL;DR
This paper introduces FERMAT, a benchmark for evaluating vision-language models on handwritten math, revealing current models' limitations in reasoning and error correction in handwritten content.
Contribution
FERMAT is a new benchmark that assesses VLMs' ability to detect, localize, and correct errors in handwritten math, filling a critical research gap.
Findings
Current VLMs struggle with handwritten math reasoning.
Gemini-1.5-Pro achieves 77% error correction rate.
Models perform better with printed or image-based inputs.
Abstract
Recent advancements in Vision-Language Models (VLMs) have opened new possibilities in automatic grading of handwritten student responses, particularly in mathematics. However, a comprehensive study to test the ability of VLMs to evaluate and reason over handwritten content remains absent. To address this gap, we introduce FERMAT, a benchmark designed to assess the ability of VLMs to detect, localize and correct errors in handwritten mathematical content. FERMAT spans four key error dimensions - computational, conceptual, notational, and presentation - and comprises over 2,200 handwritten math solutions derived from 609 manually curated problems from grades 7-12 with intentionally introduced perturbations. Using FERMAT we benchmark nine VLMs across three tasks: error detection, localization, and correction. Our results reveal significant shortcomings in current VLMs in reasoning over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
