Logical segmentation for article extraction in digitized old newspapers
Thomas Palfray (LITIS), David H\'ebert (LITIS), St\'ephane Nicolas, (LITIS), Pierrick Tranouez (LITIS), Thierry Paquet (LITIS)

TL;DR
This paper presents an automated system for extracting and structuring articles from digitized old newspapers, enabling detailed indexing, retrieval, and collaborative correction to improve access to historical newspaper archives.
Contribution
The authors develop a machine learning-based method to identify logical structures in newspaper pages and integrate this with a web interface for article-level access and correction.
Findings
Successfully applied to 250 years of the Journal de Rouen archives
Achieved pixel-level labeling and structure detection in variable quality images
Enhanced article retrieval and indexing through logical segmentation
Abstract
Newspapers are documents made of news item and informative articles. They are not meant to be red iteratively: the reader can pick his items in any order he fancies. Ignoring this structural property, most digitized newspaper archives only offer access by issue or at best by page to their content. We have built a digitization workflow that automatically extracts newspaper articles from images, which allows indexing and retrieval of information at the article level. Our back-end system extracts the logical structure of the page to produce the informative units: the articles. Each image is labelled at the pixel level, through a machine learning based method, then the page logical structure is constructed up from there by the detection of structuring entities such as horizontal and vertical separators, titles and text lines. This logical structure is stored in a METS wrapper associated to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
