Rerunning OCR: A Machine Learning Approach to Quality Assessment and   Enhancement Prediction

Pit Schneider; Yves Maurer

arXiv:2110.01661·cs.CL·June 22, 2023

Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction

Pit Schneider, Yves Maurer

PDF

Open Access 1 Repo

TL;DR

This paper presents a machine learning approach for assessing OCR quality and predicting enhancement potential, aiding decision-making in reprocessing large, diverse historical texts with minimal overhead.

Contribution

It introduces a text block level quality assessment method and a regression model to predict OCR improvement potential, tailored for cultural institutions handling historical data.

Findings

01

Effective text block quality assessment technique

02

Regression model predicts OCR enhancement potential

03

Supports decision-making in OCR reprocessing

Abstract

Iterating with new and improved OCR solutions enforces decision making when it comes to targeting the right candidates for reprocessing. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those targeting decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. Through extension of this technique, a regression model, that is able to take into account the enhancement potential of a new OCR engine, is also presented. They both mark promising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

natliblux/nautilusocr
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Digital and Traditional Archives Management · Library Science and Information Systems