Processing topical queries on images of historical newspaper pages
Jos\'e E. B. Maia, Gild\'acio J. de A. S\'a

TL;DR
This paper introduces a processing model for topic navigation in historical newspaper images, addressing challenges like low quality and lack of standardization, with initial promising results over a 28-year collection.
Contribution
It proposes a four-module system for segmenting, extracting, analyzing, and retrieving topics from degraded historical newspaper images, a novel approach in this domain.
Findings
Effective text segmentation and extraction on historical newspapers.
Initial results show promising topic retrieval accuracy.
System handles diverse and low-quality images.
Abstract
Historical newspapers are a source of research for the human and social sciences. However, these image collections are difficult to read by machine due to the low quality of the print, the lack of standardization of the pages in addition to the low quality photograph of some files. This paper presents the processing model of a topic navigation system in historical newspaper page images. The general procedure consists of four modules which are: segmentation of text sub-images and text extraction, preprocessing and representation, induced topic extraction and representation, and document viewing and retrieval interface. The algorithmic and technological approaches of each module are described and the initial test results about a collection covering a range of 28 years are presented.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Image Retrieval and Classification Techniques
MethodsTest
