Toward the Optimized Crowdsourcing Strategy for OCR Post-Correction

Omri Suissa; Avshalom Elmalech; Maayan Zhitomirsky-Geffet

arXiv:2106.06831·cs.HC·August 1, 2023

Toward the Optimized Crowdsourcing Strategy for OCR Post-Correction

Omri Suissa, Avshalom Elmalech, Maayan Zhitomirsky-Geffet

PDF

TL;DR

This paper explores how to optimize crowdsourcing strategies for correcting OCR errors in historical documents, analyzing different task structures and text lengths to improve accuracy and efficiency.

Contribution

It systematically investigates various crowdsourcing methodologies for OCR post-correction and proposes an optimal strategy based on experimental analysis.

Findings

01

Medium (paragraph-sized) text yields the best accuracy.

02

Two-phase tasks with scanned images are most accurate.

03

Longer texts in single-stage tasks without images are most efficient.

Abstract

Digitization of historical documents is a challenging task in many digital humanities projects. A popular approach for digitization is to scan the documents into images, and then convert images into text using Optical Character Recognition (OCR) algorithms. However, the outcome of OCR processing of historical documents is usually inaccurate and requires post-processing error correction. This study investigates how crowdsourcing can be utilized to correct OCR errors in historical text collections, and which crowdsourcing methodology is the most effective in different scenarios and for various research objectives. A series of experiments with different micro-task's structures and text lengths was conducted with 753 workers on the Amazon's Mechanical Turk platform. The workers had to fix OCR errors in a selected historical text. To analyze the results, new accuracy and efficiency measures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.