Processing the structure of documents: Logical Layout Analysis of   historical newspapers in French

Nicolas Gutehrl\'e; Iana Atanassova

arXiv:2202.08125·cs.CL·June 22, 2023·1 cites

Processing the structure of documents: Logical Layout Analysis of historical newspapers in French

Nicolas Gutehrl\'e, Iana Atanassova

PDF

Open Access

TL;DR

This paper presents a rule-based approach for logical layout analysis of historical French newspapers, outperforming machine learning models in recall and enabling large-scale annotation for future ML or deep learning applications.

Contribution

It introduces a rule-based system for layout analysis that surpasses ML models in recall and discusses hybrid approaches and rule learning for adapting to layout evolution.

Findings

01

Rule-based system outperforms ML models in recall

02

System covers more logical label types

03

Potential for hybrid systems and adaptive rule learning

Abstract

Background. In recent years, libraries and archives led important digitisation campaigns that opened the access to vast collections of historical documents. While such documents are often available as XML ALTO documents, they lack information about their logical structure. In this paper, we address the problem of Logical Layout Analysis applied to historical documents in French. We propose a rule-based method, that we evaluate and compare with two Machine-Learning models, namely RIPPER and Gradient Boosting. Our data set contains French newspapers, periodicals and magazines, published in the first half of the twentieth century in the Franche-Comt\'e Region. Results. Our rule-based system outperforms the two other models in nearly all evaluations. It has especially better Recall results, indicating that our system covers more types of every logical label than the other two models. When…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Rough Sets and Fuzzy Logic