From Handwriting to Structured Data: Benchmarking AI Digitisation of Handwritten Forms
Nicholas Pather, Joshua Fouch\'e, Sitwala Mundia, Karl-G\"unter Technau, Thokozile Malaba, Alex Welte, Ushma Mehta, Bruce A. Bassett

TL;DR
This paper benchmarks 17 multi-modal large language models on a challenging handwritten medical form dataset, showing recent models achieve around 85% accuracy and highlighting strengths and improvements in digitising complex handwritten data.
Contribution
It provides the first comprehensive benchmarking of state-of-the-art multi-modal LLMs on real-world handwritten form digitisation, revealing their capabilities and limitations.
Findings
Latest models reach ~85% accuracy on complex handwritten forms.
Prompt optimisation improves macro metrics by over 60%.
GPT 5.4 excels in noisy date extraction and reliability.
Abstract
Manual digitisation of structured handwritten documents is slow and costly. We benchmark 17 leading frontier multi-modal large language models and open-source models against a very challenging real-world medical form that mixes dates; structured, printed text; hand-written responses and significant variability challenges. None of the smaller or older models perform well but the latest Google and OpenAI models reach accuracies around with weighted F1 scores across the discrete or predefined fields despite the very challenging nature of the responses. Clear task specific strengths emerge: GPT 5.4 excels in noisy date extraction as well as reliability with the lowest hallucination rate (). Claude Sonnet 4.6 had the best average performance across formatted fields (dates and numerical values), while Gemini 3.1 delivered the best overall performance, with the lowest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
