Lessons learned developing and using a machine learning model to   automatically transcribe 2.3 million handwritten occupation codes

Bj{\o}rn-Richard Pedersen; Einar Holsb{\o}; Trygve Andersen; Nikita; Shvetsov; Johan Ravn; Hilde Leikny Sommerseth; Lars Ailo Bongo

arXiv:2106.03996·cs.LG·January 7, 2022

Lessons learned developing and using a machine learning model to automatically transcribe 2.3 million handwritten occupation codes

Bj{\o}rn-Richard Pedersen, Einar Holsb{\o}, Trygve Andersen, Nikita, Shvetsov, Johan Ravn, Hilde Leikny Sommerseth, Lars Ailo Bongo

PDF

1 Repo

TL;DR

This paper shares practical lessons from developing a machine learning pipeline that transcribed 2.3 million handwritten occupation codes with 97% accuracy, highlighting challenges, solutions, and verification methods for large-scale historical data transcription.

Contribution

It presents a comprehensive end-to-end machine learning pipeline for large-scale handwritten data transcription, including lessons learned and verification strategies, applicable to similar historical data projects.

Findings

01

Achieved 97% accuracy in transcribing occupation codes

02

Successfully scaled the pipeline to 2.3 million entries

03

Verified code distribution matches training data distribution

Abstract

Machine learning approaches achieve high accuracy for text recognition and are therefore increasingly used for the transcription of handwritten historical sources. However, using machine learning in production requires a streamlined end-to-end pipeline that scales to the dataset size and a model that achieves high accuracy with few manual transcriptions. The correctness of the model results must also be verified. This paper describes our lessons learned developing, tuning and using the Occode end-to-end machine learning pipeline for transcribing 2.3 million handwritten occupation codes from the Norwegian 1950 population census. We achieve an accuracy of 97% for the automatically transcribed codes, and we send 3% of the codes for manual verification. We verify that the occupation code distribution found in our results matches the distribution found in our training data, which should be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uit-hdl/rhd-codes
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.