Data Generation for Post-OCR correction of Cyrillic handwriting
Evgenii Davydkin, Aleksandr Markelov, Egor Iuldashev, Anton Dudkin,, Ivan Krivorotov

TL;DR
This paper presents a new synthetic data generation method using Bezier curves for training post-OCR correction models on handwritten Cyrillic text, improving accuracy and enabling error highlighting for educational purposes.
Contribution
It introduces a Bezier curve-based handwriting synthesis engine and applies a T5-based seq2seq model for post-OCR correction of Cyrillic handwriting, filling a data gap in the field.
Findings
Improved Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR) after correction.
Synthetic dataset effectively trains POC models for Cyrillic handwriting.
Demonstrated potential for error highlighting in educational contexts.
Abstract
This paper introduces a novel approach to post-Optical Character Recognition Correction (POC) for handwritten Cyrillic text, addressing a significant gap in current research methodologies. This gap is due to the lack of large text corporas that provide OCR errors for further training of language-based POC models, which are demanding in terms of corpora size. Our study primarily focuses on the development and application of a synthetic handwriting generation engine based on B\'ezier curves. Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset by transforming Russian text corpora sourced from the internet. We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training. The correction model is trained on a 90-symbol input context, utilizing a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Vehicle License Plate Recognition
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Inverse Square Root Schedule · Layer Normalization · Linear Layer · Attention Dropout · Gated Linear Unit · SentencePiece · Sigmoid Activation
