Processing the structure of documents: Logical Layout Analysis of historical newspapers in French
Nicolas Gutehrl\'e, Iana Atanassova

TL;DR
This paper presents a rule-based approach for logical layout analysis of historical French newspapers, outperforming machine learning models in recall and enabling large-scale annotation for future ML or deep learning applications.
Contribution
It introduces a rule-based system for layout analysis that surpasses ML models in recall and discusses hybrid approaches and rule learning for adapting to layout evolution.
Findings
Rule-based system outperforms ML models in recall
System covers more logical label types
Potential for hybrid systems and adaptive rule learning
Abstract
Background. In recent years, libraries and archives led important digitisation campaigns that opened the access to vast collections of historical documents. While such documents are often available as XML ALTO documents, they lack information about their logical structure. In this paper, we address the problem of Logical Layout Analysis applied to historical documents in French. We propose a rule-based method, that we evaluate and compare with two Machine-Learning models, namely RIPPER and Gradient Boosting. Our data set contains French newspapers, periodicals and magazines, published in the first half of the twentieth century in the Franche-Comt\'e Region. Results. Our rule-based system outperforms the two other models in nearly all evaluations. It has especially better Recall results, indicating that our system covers more types of every logical label than the other two models. When…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Rough Sets and Fuzzy Logic
