docExtractor: An off-the-shelf historical document element extraction

Tom Monnier; Mathieu Aubry

arXiv:2012.08191·cs.CV·December 16, 2020

docExtractor: An off-the-shelf historical document element extraction

Tom Monnier, Mathieu Aubry

PDF

1 Repo

TL;DR

docExtractor is a versatile, pre-trained system for extracting visual elements from historical documents, achieving high performance without dataset-specific training, crucial for digital humanities applications.

Contribution

It introduces a synthetic data generator, a convolutional network for element extraction, and a new dataset for illustration segmentation evaluation.

Findings

01

High-quality off-the-shelf performance across datasets

02

Comparable results to state-of-the-art when fine-tuned

03

Better generalization than detection-based approaches

Abstract

We present docExtractor, a generic approach for extracting visual elements such as text lines or illustrations from historical documents without requiring any real data annotation. We demonstrate it provides high-quality performances as an off-the-shelf system across a wide variety of datasets and leads to results on par with state-of-the-art when fine-tuned. We argue that the performance obtained without fine-tuning on a specific dataset is critical for applications, in particular in digital humanities, and that the line-level page segmentation we address is the most relevant for a general purpose element extraction engine. We rely on a fast generator of rich synthetic documents and design a fully convolutional network, which we show to generalize better than a detection-based approach. Furthermore, we introduce a new public dataset dubbed IlluHisDoc dedicated to the fine evaluation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

monniert/docExtractor
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.