CENSUS-HWR: a large training dataset for offline handwriting recognition
Chetan Joshi, Lawry Sorenson, Ammon Wolfert, Mark Clement and, Joseph Price, Kasey Buckles

TL;DR
CENSUS-HWR introduces a large-scale dataset of over 1.8 million handwritten English words from census records, aiming to advance offline handwriting recognition by providing a comprehensive benchmark for deep learning models.
Contribution
The paper presents a new extensive dataset for offline handwriting recognition, derived from historical census data, and provides trained models to facilitate research.
Findings
Dataset contains 1.86 million handwritten texts from 10,711 words.
The dataset is extracted from US 1930 and 1940 censuses.
Models trained on this dataset achieve improved recognition performance.
Abstract
Progress in Automated Handwriting Recognition has been hampered by the lack of large training datasets. Nearly all research uses a set of small datasets that often cause models to overfit. We present CENSUS-HWR, a new dataset consisting of full English handwritten words in 1,812,014 gray scale images. A total of 1,865,134 handwritten texts from a vocabulary of 10,711 words in the English language are present in this collection. This dataset is intended to serve handwriting models as a benchmark for deep learning algorithms. This huge English handwriting recognition dataset has been extracted from the US 1930 and 1940 censuses taken by approximately 70,000 enumerators each year. The dataset and the trained model with their weights are freely available to download at https://censustree.org/data.html.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Hand Gesture Recognition Systems
