Document Layout Annotation: Database and Benchmark in the Domain of Public Affairs
Alejandro Pe\~na, Aythami Morales, Julian Fierrez, Javier, Ortega-Garcia, Marcos Grande, I\~nigo Puente, Jorge Cordova, Gonzalo Cordova

TL;DR
This paper introduces a new annotated database for Document Layout Analysis in the public affairs domain, created through a semi-automatic labeling procedure applied to Spanish government documents, facilitating research and development in document understanding.
Contribution
The work presents a novel, large-scale annotated dataset for DLA in public documents, along with a semi-automatic labeling method validated with high accuracy.
Findings
The dataset contains 37.9K documents and 8M labels.
The labeling procedure achieves up to 99% accuracy.
The dataset supports advanced research in document layout understanding.
Abstract
Every day, thousands of digital documents are generated with useful information for companies, public organizations, and citizens. Given the impossibility of processing them manually, the automatic processing of these documents is becoming increasingly necessary in certain sectors. However, this task remains challenging, since in most cases a text-only based parsing is not enough to fully understand the information presented through different components of varying significance. In this regard, Document Layout Analysis (DLA) has been an interesting research field for many years, which aims to detect and classify the basic components of a document. In this work, we used a procedure to semi-automatically annotate digital documents with different layout labels, including 4 basic layout blocks and 4 text categories. We apply this procedure to collect a novel database for DLA in the public…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Handwritten Text Recognition Techniques · Web Data Mining and Analysis
MethodsDeep Layer Aggregation
