The Notary in the Haystack -- Countering Class Imbalance in Document Processing with CNNs
Martin Leipert, Georg Vogeler, Mathias Seuret, Andreas Maier, Vincent, Christlein

TL;DR
This paper investigates methods to address class imbalance in document processing using CNNs, focusing on identifying notarial instruments and their notary signs in medieval documents, and evaluates various techniques for classification and segmentation tasks.
Contribution
It systematically evaluates countermeasures like data augmentation, oversampling, and specialized loss functions to improve CNN performance on imbalanced document datasets.
Findings
Oversampling combined with data augmentation yields best classification results.
Class-weighted dice loss effectively segments notary signs.
Countermeasures improve CNN accuracy on imbalanced document data.
Abstract
Notarial instruments are a category of documents. A notarial instrument can be distinguished from other documents by its notary sign, a prominent symbol in the certificate, which also allows to identify the document's issuer. Naturally, notarial instruments are underrepresented in regard to other documents. This makes a classification difficult because class imbalance in training data worsens the performance of Convolutional Neural Networks. In this work, we evaluate different countermeasures for this problem. They are applied to a binary classification and a segmentation task on a collection of medieval documents. In classification, notarial instruments are distinguished from other documents, while the notary sign is separated from the certificate in the segmentation task. We evaluate different techniques, such as data augmentation, under- and oversampling, as well as regularizing with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDice Loss
