A Survey of Historical Document Image Datasets
Konstantina Nikolaidou, Mathias Seuret, Hamam Mokayed, Marcus Liwicki

TL;DR
This paper systematically reviews 65 datasets for historical document image analysis, categorizing tasks, summarizing dataset features, and discussing challenges and standards to improve research comparability.
Contribution
It provides a comprehensive meta-study of existing datasets, including statistics, task types, and benchmarks, highlighting gaps and proposing standardization practices.
Findings
65 datasets analyzed with detailed summaries
Identification of gaps in dataset formats and evaluation metrics
Recommendations for standardization and conversion tools
Abstract
This paper presents a systematic literature review of image datasets for document image analysis, focusing on historical documents, such as handwritten manuscripts and early prints. Finding appropriate datasets for historical document analysis is a crucial prerequisite to facilitate research using different machine learning algorithms. However, because of the very large variety of the actual data (e.g., scripts, tasks, dates, support systems, and amount of deterioration), the different formats for data and label representation, and the different evaluation processes and benchmarks, finding appropriate datasets is a difficult task. This work fills this gap, presenting a meta-study on existing datasets. After a systematic selection process (according to PRISMA guidelines), we select 65 studies that are chosen based on different factors, such as the year of publication, number of methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Image Processing and 3D Reconstruction
