Importance of Textlines in Historical Document Classification

Martin Ki\v{s}\v{s}; Jan Koh\'ut; Karel Bene\v{s}; Michal Hradi\v{s}

arXiv:2201.09575·cs.CV·March 31, 2022·1 cites

Importance of Textlines in Historical Document Classification

Martin Ki\v{s}\v{s}, Jan Koh\'ut, Karel Bene\v{s}, Michal Hradi\v{s}

PDF

Open Access

TL;DR

This paper presents a hybrid neural network system combining patch and line-level approaches for classifying historical documents, localizing origin, and dating, achieving top results in the ICDAR 2021 competition.

Contribution

It introduces a novel combination of patch and line-level neural network methods with specialized loss functions for weak supervision and interval regression.

Findings

01

Achieved 98.48% accuracy in font classification

02

Achieved 88.84% accuracy in script classification

03

Mean absolute error of 21.91 years in dating

Abstract

This paper describes a system prepared at Brno University of Technology for ICDAR 2021 Competition on Historical Document Classification, experiments leading to its design, and the main findings. The solved tasks include script and font classification, document origin localization, and dating. We combined patch-level and line-level approaches, where the line-level system utilizes an existing, publicly available page layout analysis engine. In both systems, neural networks provide local predictions which are combined into page-level decisions, and the results of both systems are fused using linear or log-linear interpolation. We propose loss functions suitable for weakly supervised classification problem where multiple possible labels are provided, and we propose loss functions suitable for interval regression in the dating task. The line-level system significantly improves results in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Digital Media Forensic Detection · Text and Document Classification Technologies