Historical German Text Normalization Using Type- and Token-Based Language Modeling
Anton Ehrmanntraut

TL;DR
This paper presents a Transformer-based system for normalizing historical German texts from 1700-1900, improving accuracy in converting old spellings to modern equivalents using a combination of encoder-decoder and causal language models.
Contribution
It introduces a novel hybrid Transformer approach that combines word-level normalization with contextual adjustment, achieving state-of-the-art accuracy on historical German texts.
Findings
State-of-the-art normalization accuracy achieved
Comparable performance to larger end-to-end systems
Challenges remain due to limited parallel data and model generalization
Abstract
Historic variations of spelling poses a challenge for full-text search or natural language processing on historical digitized texts. To minimize the gap between the historic orthography and contemporary spelling, usually an automatic orthographic normalization of the historical source material is pursued. This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus. The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context. An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing
MethodsByte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Linear Layer · Adam · Dropout · Layer Normalization · Dense Connections · Attention Is All You Need
