OCR Post-Processing Error Correction Algorithm using Google Online   Spelling Suggestion

Youssef Bassil; Mohammad Alwani

arXiv:1204.0191·cs.CL·April 3, 2012·75 cites

OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion

Youssef Bassil, Mohammad Alwani

PDF

Open Access

TL;DR

This paper presents a post-processing error correction algorithm for OCR outputs that leverages Google's online spelling suggestions to significantly improve correction accuracy of misspelled words.

Contribution

It introduces a context-based correction method using Google's database, enhancing OCR error correction beyond traditional approaches.

Findings

01

Significant improvement in OCR error correction rate

02

Effective detection and correction of non-word and real-word errors

03

Potential for parallelization and faster processing

Abstract

With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for Optical Character Recognition was developed to translate scanned graphical text into editable computer text. Unfortunately, OCR is still imperfect as it occasionally mis-recognizes letters and falsely identifies scanned text, leading to misspellings and linguistics errors in the OCR output text. This paper proposes a post-processing context-based error correction algorithm for detecting and correcting OCR non-word and real-word errors. The proposed algorithm is based on Google's online spelling suggestion which harnesses an internal database containing a huge collection of terms and word sequences gathered from all over the web, convenient to suggest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Vehicle License Plate Recognition