Diachronic Document Dataset for Semantic Layout Analysis
Thibault Cl\'erice (ALMAnaCH), Juliette Janes (ALMAnaCH), Hugo, Scheithauer, Sarah B\'eni\`ere (ALMAnaCH), Florian Cafiero (PSL), Laurent, Romary (ALMAnaCH, DCIS), Simon Gabay, Beno\^it Sagot

TL;DR
This paper introduces a comprehensive, annotated dataset of historical and modern documents for semantic layout analysis, supporting diverse document types and evaluating object detection models to improve layout understanding.
Contribution
It provides a large, annotated, multi-genre dataset spanning centuries, and evaluates object detection models to optimize document layout analysis.
Findings
1280-pixel input size is optimal for YOLO.
Training on subsets improves performance over fine-tuning.
Dataset supports diverse document types and historical periods.
Abstract
We present a novel, open-access dataset designed for semantic layout analysis, built to support document recreation workflows through mapping with the Text Encoding Initiative (TEI) standard. This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and born-digital materials across diverse document types (magazines, papers from sciences and humanities, PhD theses, monographs, plays, administrative reports, etc.) sorted into modular subsets. By incorporating content from different periods and genres, it addresses varying layout complexities and historical changes in document structure. The modular design allows domain-specific configurations. We evaluate object detection models on this dataset, examining the impact of input size and subset-based training. Results show that a 1280-pixel input size for YOLO is optimal and that training on subsets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Semantic Web and Ontologies
