Noise-Aware Named Entity Recognition for Historical VET Documents
Alexander M. Esser, Jens D\"orpinghaus

TL;DR
This paper introduces a noise-aware NER method for historical VET documents that improves recognition accuracy in noisy OCR conditions using synthetic errors, transfer learning, and multi-stage fine-tuning.
Contribution
It presents a novel noise-aware training approach for NER in noisy, historical domain-specific documents, including multiple entity types and transferability to other languages.
Findings
Significant accuracy improvement with noise-aware fine-tuning
Effective recognition of multiple entity types in VET documents
Method applicable to various languages beyond German
Abstract
This paper addresses Named Entity Recognition (NER) in the domain of Vocational Education and Training (VET), focusing on historical, digitized documents that suffer from OCR-induced noise. We propose a robust NER approach leveraging Noise-Aware Training (NAT) with synthetically injected OCR errors, transfer learning, and multi-stage fine-tuning. Three complementary strategies, training on noisy, clean, and artificial data, are systematically compared. Our method is one of the first to recognize multiple entity types in VET documents. It is applied to German documents but transferable to arbitrary languages. Experimental results demonstrate that domain-specific and noise-aware fine-tuning substantially increases robustness and accuracy under noisy conditions. We provide publicly available code for reproducible noise-aware NER in domain-specific contexts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Text and Document Classification Technologies
