Post-OCR Document Correction with large Ensembles of Character   Sequence-to-Sequence Models

Juan Ramirez-Orta; Eduardo Xamena; Ana Maguitman; Evangelos; Milios; Axel J. Soto

arXiv:2109.06264·cs.CL·January 26, 2022·1 cites

Post-OCR Document Correction with large Ensembles of Character Sequence-to-Sequence Models

Juan Ramirez-Orta, Eduardo Xamena, Ana Maguitman, Evangelos, Milios, Axel J. Soto

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel ensemble-based character sequence-to-sequence approach for post-OCR document correction, effectively handling long strings and achieving state-of-the-art results across multiple languages.

Contribution

It presents strategies for processing long OCR texts efficiently using ensemble models and voting schemes, advancing post-OCR correction methods.

Findings

01

Achieved state-of-the-art results in five languages from ICDAR 2019.

02

Developed a voting scheme for ensemble correction of long strings.

03

Demonstrated resource-efficient processing of long documents.

Abstract

In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model while being sample- and resource-efficient, supported by thorough experimentation. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jarobyte91/post_ocr_correction
pytorchOfficial

Videos

Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models· underline

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Video Analysis and Summarization