Combining Morphological and Histogram based Text Line Segmentation in   the OCR Context

Pit Schneider

arXiv:2103.08922·cs.CV·June 22, 2023

Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

Pit Schneider

PDF

1 Repo

TL;DR

This paper presents a robust, low-cost text line segmentation method combining morphological operations and histogram projections, improving accuracy and speed for historic document OCR processing.

Contribution

It introduces a novel combination of morphological and histogram techniques tailored for degraded historic documents, enhancing OCR pipeline performance.

Findings

01

Improved segmentation accuracy on historic documents

02

Reduced computational cost compared to existing methods

03

Successful integration into a national OCR pipeline

Abstract

Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or presence of noise. For that reason, the segmenter in question could be of particular interest for cultural institutions, that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

natliblux/nautilusocr
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.