M$^{6}$Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout,   Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout   Analysis

Hiuyi Cheng; Peirong Zhang; Sihang Wu; Jiaxin Zhang; Qiyuan Zhu,; Zecheng Xie; Jing Li; Kai Ding; and Lianwen Jin

arXiv:2305.08719·cs.CV·May 23, 2023·5 cites

M$^{6}$Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis

Hiuyi Cheng, Peirong Zhang, Sihang Wu, Jiaxin Zhang, Qiyuan Zhu,, Zecheng Xie, Jing Li, Kai Ding, and Lianwen Jin

PDF

Open Access

TL;DR

This paper introduces $M^{6}$Doc, a comprehensive large-scale dataset for document layout analysis covering diverse formats, types, layouts, languages, and annotations, along with a transformer-based analysis method called TransDLANet that achieves state-of-the-art results.

Contribution

The paper presents a novel, extensive dataset for modern document layout analysis and a new transformer-based model that improves analysis accuracy across diverse document types.

Findings

01

$M^{6}$Doc contains 237,116 annotations across 9,080 pages.

02

TransDLANet achieves 64.5% mAP on $M^{6}$Doc.

03

The dataset enhances model generalization to real-world documents.

Abstract

Document layout analysis is a crucial prerequisite for document understanding, including document retrieval and conversion. Most public datasets currently contain only PDF documents and lack realistic documents. Models trained on these datasets may not generalize well to real-world scenarios. Therefore, this paper introduces a large and diverse document layout analysis dataset called $M^{6} D oc$ . The $M^{6}$ designation represents six properties: (1) Multi-Format (including scanned, photographed, and PDF documents); (2) Multi-Type (such as scientific articles, textbooks, books, test papers, magazines, newspapers, and notes); (3) Multi-Layout (rectangular, Manhattan, non-Manhattan, and multi-column Manhattan); (4) Multi-Language (Chinese and English); (5) Multi-Annotation Category (74 types of annotation labels with 237,116 annotation instances in 9,080 manually annotated pages); and (6)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsTest