PubLayNet: largest dataset ever for document layout analysis

Xu Zhong; Jianbin Tang; Antonio Jimeno Yepes

arXiv:1908.07836·cs.CL·August 22, 2019·39 cites

PubLayNet: largest dataset ever for document layout analysis

Xu Zhong, Jianbin Tang, Antonio Jimeno Yepes

PDF

Open Access 5 Repos 4 Models 5 Datasets

TL;DR

PubLayNet is a large-scale dataset of over 360,000 document images with annotated layout elements, enabling more effective training of neural networks for document layout analysis, especially in scientific articles.

Contribution

The paper introduces PubLayNet, the largest publicly available dataset for document layout analysis, facilitating improved deep learning models for scientific document understanding.

Findings

01

Deep neural networks trained on PubLayNet accurately recognize scientific article layouts.

02

Pre-trained models on PubLayNet are effective for transfer learning in different document domains.

03

The dataset enables development of more advanced document layout analysis models.

Abstract

Recognizing the layout of unstructured digital documents is an important step when parsing the documents into structured machine-readable format for downstream applications. Deep neural networks that are developed for computer vision have been proven to be an effective method to analyze layout of document images. However, document layout datasets that are currently publicly available are several magnitudes smaller than established computing vision datasets. Models have to be trained by transfer learning from a base model that is pre-trained on a traditional computer vision dataset. In this paper, we develop the PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques