DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of   Historical Documents

Taylor Archibald; Tony Martinez

arXiv:2404.19259·cs.CV·May 1, 2024·1 cites

DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents

Taylor Archibald, Tony Martinez

PDF

Open Access

TL;DR

This paper introduces DELINE8K, a comprehensive synthetic dataset for semantic segmentation of historical documents, addressing limitations of existing datasets and improving performance on the NAFSS benchmark.

Contribution

We developed DELINE8K, the most extensive synthetic dataset for document segmentation, combining diverse text types and backgrounds to enhance model training.

Findings

01

Deline8K outperforms previous datasets on NAFSS benchmark.

02

Synthetic data improves segmentation accuracy for historical documents.

03

The dataset is publicly available for further research.

Abstract

Document semantic segmentation is a promising avenue that can facilitate document analysis tasks, including optical character recognition (OCR), form classification, and document editing. Although several synthetic datasets have been developed to distinguish handwriting from printed text, they fall short in class variety and document diversity. We demonstrate the limitations of training on existing datasets when solving the National Archives Form Semantic Segmentation dataset (NAFSS), a dataset which we introduce. To address these limitations, we propose the most comprehensive document semantic segmentation synthesis pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources to create the Document Element Layer INtegration Ensemble 8K, or DELINE8K dataset. Our customized dataset exhibits superior performance on the NAFSS benchmark,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction