Graph-based Document Structure Analysis
Yufan Chen, Ruiping Liu, Junwei Zheng, Di Wen, Kunyu Peng, Jiaming, Zhang, Rainer Stiefelhagen

TL;DR
This paper introduces a graph-based approach to document structure analysis, enabling models to detect elements and infer complex spatial and logical relations, advancing beyond traditional superficial methods.
Contribution
The paper presents a new gDSA task, a large dataset, and a relation graph generator, facilitating holistic document understanding through spatial and logical relation inference.
Findings
Achieved 57.6% [email protected] on the new dataset
Constructed a dataset with 80K images and 4.13M annotations
Demonstrated improved document comprehension capabilities
Abstract
When reading a document, glancing at the spatial layout of a document is an initial step to understand it roughly. Traditional document layout analysis (DLA) methods, however, offer only a superficial parsing of documents, focusing on basic instance detection and often failing to capture the nuanced spatial and logical relations between instances. These limitations hinder DLA-based models from achieving a gradually deeper comprehension akin to human reading. In this work, we propose a novel graph-based Document Structure Analysis (gDSA) task. This task requires that model not only detects document elements but also generates spatial and logical relations in form of a graph structure, allowing to understand documents in a holistic and intuitive manner. For this new task, we construct a relation graph-based document structure analysis dataset (GraphDoc) with 80K document images and 4.13M…
Peer Reviews
Decision·ICLR 2025 Poster
1. The research problem is interesting and novel which is the less explored in this area. 2. The dataset generation and the evaluation metrics look reasonable. 3. Some analyses are conducted to show the insight of the datasets such as the correlations and co-occurrence of different relation pairs.
1. Some key concepts are not clearly defined: e.g. why spatial relations are important, why only four spatial relation types are defined, Is proper all documents from different domains to adopt the same rule to define the relations between semantic entities. 2. Lack of analysis about rule-based generated relation and human refining details which is essential for demonstrating the high quality of the dataset. 3. It would be clearer if the detailed categorisation of parent, child, and reference re
This is a large new dataset (80,000 single-page document images ; 1.10 million instances across 11 categories: Caption, Footnote, Formula, Listitem, Page-footer, Page-header, Picture, Section-header, Table, Text, and Title.) that extends DocLayNet Pfitzmann et al. (2022), and the relative location of the document elements is key to document understanding
This paper doesn't really have any major weaknesses. it is mostly a dataset paper and the dataset is strong. The authors also introduce a new model, which is rather straightforward but still interesting.
The author compares the proposed DRGG framework with several existing document layout analysis and graph structure analysis methods, demonstrating through experiments that this model exhibits outstanding performance on the gDSA task, particularly in handling complex document structures, which is significant for achieving deeper document understanding.
1. Most document instances exhibit multiple relationships. The paper should provide additional experiments to evaluate the precision and recall of DRGG in capturing instances that have both spatial and logical relationships, compared to those with only one type of relationship. 2. In the dataset, the number of samples in the position relationship category is significantly higher than that in the logical relationship category. It would be better to consider adding experiments to demonstrate how t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Web Data Mining and Analysis · Advanced Graph Neural Networks
