Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio -- Episode 1: Machine Transcription of the Manuscripts
Donatella Firmani, Marco Maiorino, Paolo Merialdo, and Elena Nieddu

TL;DR
This paper presents a scalable machine transcription system for medieval manuscripts from the Vatican Secret Archives, combining character segmentation, neural recognition, and language models to assist paleographers in digitizing large volumes efficiently.
Contribution
It introduces an original character segmentation approach with minimal training effort, enabling scalable transcription of handwritten manuscripts using neural networks and crowdsourced data.
Findings
System achieved good transcriptions on Vatican manuscripts
Training with data from 120 students proved effective
Reduces manual effort for paleographers
Abstract
In Codice Ratio is a research project to study tools and techniques for analyzing the contents of historical documents conserved in the Vatican Secret Archives (VSA). In this paper, we present our efforts to develop a system to support the transcription of medieval manuscripts. The goal is to provide paleographers with a tool to reduce their efforts in transcribing large volumes, as those stored in the VSA, producing good transcriptions for significant portions of the manuscripts. We propose an original approach based on character segmentation. Our solution is able to deal with the dirty segmentation that inevitably occurs in handwritten documents. We use a convolutional neural network to recognize characters and language models to compose word transcriptions. Our approach requires minimal training efforts, making the transcription process more scalable as the production of training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
