From OCR to Analysis: Tracking Correction Provenance in Digital Humanities Pipelines
Haoze Guo, Ziqi Wei

TL;DR
This paper introduces a provenance-aware framework for OCR correction in digital humanities, enabling detailed tracking of text transformations to improve interpretability and reproducibility.
Contribution
It presents a novel provenance tracking system that records correction details at the span level, enhancing transparency and analysis in OCR workflows for humanities texts.
Findings
Correction pathways significantly alter named entity extraction results.
Provenance signals help identify unstable outputs for review.
Provenance tracking supports reproducibility and source criticism.
Abstract
Optical Character Recognition (OCR) is a critical but error-prone stage in digital humanities text pipelines. While OCR correction improves usability for downstream NLP tasks, common workflows often overwrite intermediate decisions, obscuring how textual transformations affect scholarly interpretation. We present a provenance-aware framework for OCR-corrected humanities corpora that records correction lineage at the span level, including edit type, correction source, confidence, and revision status. Using a pilot corpus of historical texts, we compare downstream named entity extraction across raw OCR, fully corrected text, and provenance-filtered corrections. Our results show that correction pathways can substantially alter extracted entities and document-level interpretations, while provenance signals help identify unstable outputs and prioritize human review. We argue that provenance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
