Digital Editions as Distant Supervision for Layout Analysis of Printed   Books

Alejandro H. Toselli; Si Wu; David A. Smith

arXiv:2112.12703·cs.CV·December 24, 2021

Digital Editions as Distant Supervision for Layout Analysis of Printed Books

Alejandro H. Toselli, Si Wu, David A. Smith

PDF

1 Repo

TL;DR

This paper proposes leveraging semantic markup in digital editions as distant supervision to train and evaluate layout analysis models for historical printed books, demonstrating promising results and potential for generalization.

Contribution

It introduces a novel approach to use existing digital editions' markup as supervision signals for layout analysis, improving training efficiency and evaluation methods.

Findings

01

High correlation between region-level and pixel/word-level metrics.

02

Models trained on DTA generalize to other historical books.

03

Potential for improving accuracy with self-training.

Abstract

Archivists, textual scholars, and historians often produce digital editions of historical documents. Using markup schemes such as those of the Text Encoding Initiative and EpiDoc, these digital editions often record documents' semantic regions (such as notes and figures) and physical features (such as page and line breaks) as well as transcribing their textual content. We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models. In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics. We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nulabtmn/printedbooklayout
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.