DocEmul: a Toolkit to Generate Structured Historical Documents

Samuele Capobianco; Simone Marinai

arXiv:1710.03474·cs.CV·October 11, 2017

DocEmul: a Toolkit to Generate Structured Historical Documents

Samuele Capobianco, Simone Marinai

PDF

1 Repo

TL;DR

This paper introduces DocEmul, a toolkit for generating synthetic structured handwritten documents to augment training data for document analysis tasks, demonstrated by improving record counting accuracy.

Contribution

The paper presents a novel toolkit that creates realistic synthetic handwritten documents with variable structures and noise for training deep learning models.

Findings

01

Synthetic data improved record counting accuracy.

02

The toolkit effectively mimics real document variability.

03

Enhanced training datasets led to better model performance.

Abstract

We propose a toolkit to generate structured synthetic documents emulating the actual document production process. Synthetic documents can be used to train systems to perform document analysis tasks. In our case we address the record counting task on handwritten structured collections containing a limited number of examples. Using the DocEmul toolkit we can generate a larger dataset to train a deep architecture to predict the number of records for each page. The toolkit is able to generate synthetic collections and also perform data augmentation to create a larger trainable dataset. It includes one method to extract the page background from real pages which can be used as a substrate where records can be written on the basis of variable structures and using cursive fonts. Moreover, it is possible to extend the synthetic collection by adding random noise, page rotations, and other visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

scstech85/DocEmul
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.