TL;DR
This paper introduces a transfer learning approach for OCRopus models that significantly improves character recognition accuracy on early printed books by leveraging existing models and flexible alphabet adaptation, especially with limited training data.
Contribution
It presents a novel method for transfer learning in OCRopus, enabling flexible alphabet expansion and reduction, and demonstrates substantial error reduction on early printed books with minimal training data.
Findings
Error rates reduced by up to 43% using transfer learning.
Training from mixed models trained on unrelated data still improves results.
Significant improvements achieved with as few as 60 lines of ground truth.
Abstract
A method is presented that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books when only small amounts of diplomatic transcriptions are available. This is achieved by building from already existing models during training instead of starting from scratch. To overcome the discrepancies between the set of characters of the pretrained model and the additional ground truth the OCRopus code is adapted to allow for alphabet expansion or reduction. The character set is now capable of flexibly adding and deleting characters from the pretrained alphabet when an existing model is loaded. For our experiments we use a self-trained mixed model on early Latin prints and the two standard OCRopus models on modern English and German Fraktur texts. The evaluation on seven early printed books showed that training from the Latin mixed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
