Estimating Post-OCR Denoising Complexity on Numerical Texts
Arthur Hemmer, J\'er\^ome Brachat, Micka\"el Coustaty, Jean-Marc Ogier

TL;DR
This paper introduces a method to estimate the difficulty of post-OCR denoising specifically for numerical texts, revealing that such texts are more challenging than natural language texts and validating the estimator with modern denoising error rates.
Contribution
The paper proposes a novel complexity estimation method tailored for numerical texts and demonstrates its effectiveness across various datasets, highlighting the increased challenge of numerical OCR post-processing.
Findings
Numerical texts have higher denoising complexity than natural language texts.
The proposed estimator correlates well with actual denoising error rates.
Numerical document types pose greater challenges for OCR post-processing.
Abstract
Post-OCR processing has significantly improved over the past few years. However, these have been primarily beneficial for texts consisting of natural, alphabetical words, as opposed to documents of numerical nature such as invoices, payslips, medical certificates, etc. To evaluate the OCR post-processing difficulty of these datasets, we propose a method to estimate the denoising complexity of a text and evaluate it on several datasets of varying nature, and show that texts of numerical nature have a significant disadvantage. We evaluate the estimated complexity ranking with respect to the error rates of modern-day denoising approaches to show the validity of our estimator.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Music and Audio Processing
