DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents
Taylor Archibald, Tony Martinez

TL;DR
This paper introduces DELINE8K, a comprehensive synthetic dataset for semantic segmentation of historical documents, addressing limitations of existing datasets and improving performance on the NAFSS benchmark.
Contribution
We developed DELINE8K, the most extensive synthetic dataset for document segmentation, combining diverse text types and backgrounds to enhance model training.
Findings
Deline8K outperforms previous datasets on NAFSS benchmark.
Synthetic data improves segmentation accuracy for historical documents.
The dataset is publicly available for further research.
Abstract
Document semantic segmentation is a promising avenue that can facilitate document analysis tasks, including optical character recognition (OCR), form classification, and document editing. Although several synthetic datasets have been developed to distinguish handwriting from printed text, they fall short in class variety and document diversity. We demonstrate the limitations of training on existing datasets when solving the National Archives Form Semantic Segmentation dataset (NAFSS), a dataset which we introduce. To address these limitations, we propose the most comprehensive document semantic segmentation synthesis pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources to create the Document Element Layer INtegration Ensemble 8K, or DELINE8K dataset. Our customized dataset exhibits superior performance on the NAFSS benchmark,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction
