Handwriting Recognition in Historical Documents with Multimodal LLM
Lucian Li

TL;DR
This paper evaluates the effectiveness of multimodal large language models like Gemini in transcribing handwritten historical documents, comparing their performance to traditional Transformer-based OCR methods, with implications for cultural preservation.
Contribution
It introduces the application of multimodal LLMs to handwritten OCR, demonstrating their potential to outperform traditional models with minimal training data.
Findings
Multimodal LLMs achieve higher transcription accuracy than traditional methods.
Few-shot prompting enables effective handwriting recognition.
Potential for improved mass digitization of historical documents.
Abstract
There is an immense quantity of historical and cultural documentation that exists only as handwritten manuscripts. At the same time, performing OCR across scripts and different handwriting styles has proven to be an enormously difficult problem relative to the process of digitizing print. While recent Transformer based models have achieved relatively strong performance, they rely heavily on manually transcribed training data and have difficulty generalizing across writers. Multimodal LLM, such as GPT-4v and Gemini, have demonstrated effectiveness in performing OCR and computer vision tasks with few shot prompting. In this paper, I evaluate the accuracy of handwritten document transcriptions generated by Gemini against the current state of the art Transformer based methods. Keywords: Optical Character Recognition, Multimodal Language Models, Cultural Preservation, Mass digitization,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Multi-Head Attention · Softmax
