Is text normalization relevant for classifying medieval charters?
Florian Atzenhofer-Baumgartner, Tam\'as Kov\'acs

TL;DR
This paper investigates whether text normalization enhances the classification of medieval charters, finding minimal benefits for locating and potential drawbacks for dating, highlighting the importance of preserving original textual features.
Contribution
It provides an empirical evaluation of normalization's impact on medieval document classification, comparing traditional and transformer models with and without normalization.
Findings
Normalization minimally improves locating accuracy
Normalization reduces dating accuracy
Support vector machines and gradient boosting outperform transformers
Abstract
This study examines the impact of historical text normalization on the classification of medieval charters, specifically focusing on document dating and locating. Using a data set of Middle High German charters from a digital archive, we evaluate various classifiers, including traditional and transformer-based models, with and without normalization. Our results indicate that the given normalization minimally improves locating tasks but reduces accuracy for dating, implying that original texts contain crucial features that normalization may obscure. We find that support vector machines and gradient boosting outperform other models, questioning the efficiency of transformers for this use case. Results suggest a selective approach to historical text normalization, emphasizing the significance of preserving some textual characteristics that are critical for classification tasks in document…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedieval Literature and History · Translation Studies and Practices · Medieval Iberian Studies
MethodsSparse Evolutionary Training
