Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora
Sajawel Ahmed, Alexander Mehler

TL;DR
This paper enhances neural named entity recognition for low-resource languages by optimizing large corpora with morphological processing, achieving significant performance gains and establishing new state-of-the-art results.
Contribution
It introduces a resource optimization approach with morphological preprocessing, outperforming existing models without designing deeper neural architectures.
Findings
Up to 11% F-score improvement on German NER
Optimized corpora significantly boost downstream task performance
Morphological processing enhances data quality for NER
Abstract
This study improves the performance of neural named entity recognition by a margin of up to 11% in F-score on the example of a low-resource language like German, thereby outperforming existing baselines and establishing a new state-of-the-art on each single open-source dataset. Rather than designing deeper and wider hybrid neural architectures, we gather all available resources and perform a detailed optimization and grammar-dependent morphological processing consisting of lemmatization and part-of-speech tagging prior to exposing the raw data to any training process. We test our approach in a threefold monolingual experimental setup of a) single, b) joint, and c) optimized training and shed light on the dependency of downstream-tasks on the size of corpora used to compute word embeddings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
