TL;DR
This paper proposes leveraging semantic markup in digital editions as distant supervision to train and evaluate layout analysis models for historical printed books, demonstrating promising results and potential for generalization.
Contribution
It introduces a novel approach to use existing digital editions' markup as supervision signals for layout analysis, improving training efficiency and evaluation methods.
Findings
High correlation between region-level and pixel/word-level metrics.
Models trained on DTA generalize to other historical books.
Potential for improving accuracy with self-training.
Abstract
Archivists, textual scholars, and historians often produce digital editions of historical documents. Using markup schemes such as those of the Text Encoding Initiative and EpiDoc, these digital editions often record documents' semantic regions (such as notes and figures) and physical features (such as page and line breaks) as well as transcribing their textual content. We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models. In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics. We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
