Towards Knowledge Discovery from the Vatican Secret Archives. In Codice   Ratio -- Episode 1: Machine Transcription of the Manuscripts

Donatella Firmani; Marco Maiorino; Paolo Merialdo; and Elena Nieddu

arXiv:1803.03200·cs.DL·June 16, 2020

Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio -- Episode 1: Machine Transcription of the Manuscripts

Donatella Firmani, Marco Maiorino, Paolo Merialdo, and Elena Nieddu

PDF

TL;DR

This paper presents a scalable machine transcription system for medieval manuscripts from the Vatican Secret Archives, combining character segmentation, neural recognition, and language models to assist paleographers in digitizing large volumes efficiently.

Contribution

It introduces an original character segmentation approach with minimal training effort, enabling scalable transcription of handwritten manuscripts using neural networks and crowdsourced data.

Findings

01

System achieved good transcriptions on Vatican manuscripts

02

Training with data from 120 students proved effective

03

Reduces manual effort for paleographers

Abstract

In Codice Ratio is a research project to study tools and techniques for analyzing the contents of historical documents conserved in the Vatican Secret Archives (VSA). In this paper, we present our efforts to develop a system to support the transcription of medieval manuscripts. The goal is to provide paleographers with a tool to reduce their efforts in transcribing large volumes, as those stored in the VSA, producing good transcriptions for significant portions of the manuscripts. We propose an original approach based on character segmentation. Our solution is able to deal with the dirty segmentation that inevitably occurs in handwritten documents. We use a convolutional neural network to recognize characters and language models to compose word transcriptions. Our approach requires minimal training efforts, making the transcription process more scalable as the production of training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.