TL;DR
This paper develops an open-source OCR pipeline for transcribing medieval Latin legal manuscripts, achieving high accuracy and making the records accessible to scholars and students.
Contribution
It introduces a neural network-based OCR pipeline trained on a novel dataset, with post-processing techniques that significantly improve transcription accuracy.
Findings
Neural network models achieved 79% word accuracy on medieval Latin texts.
Adding an n-gram language model increased accuracy to 82%.
Using Gemini Pro 3 correction boosted accuracy to 88%.
Abstract
The record of the beginning of the most widespread legal system in the world is contained in millions of pages of handwritten text. Most of the records of the first centuries of the Anglo-American legal system are hand-written in a highly abbreviated form of medieval Latin which only a few dozen scholars in the world are trained to read. In this interdisciplinary project, we construct a dataset of 4029 lines of text across 193 medieval criminal and civil cases. We then use the dataset to train an open-source end-to-end pipeline for transcribing these manuscripts. We first train standard neural network architectures for line segmentation and handwriting recognition (R-Blla and CNN+LSTM with CTC decoding, respectively) and show that they can already achieve 79% word accuracy, despite the relatively small training set and the challenge of expanding abbreviations. We then demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
