An Investigation of Noise in Morphological Inflection
Adam Wiemerslage, Changbing Yang, Garrett Nicolai, Miikka Silfverberg,, and Katharina Kann

TL;DR
This paper investigates the impact of noise in training data for morphological inflection systems, proposing an error taxonomy, analyzing effects on models, and introducing a CMLM pretraining method to improve noise robustness.
Contribution
It introduces a comprehensive error taxonomy for morphological inflection data, compares noise effects across models, and proposes a novel CMLM pretraining approach to enhance noise resistance.
Findings
Encoder-decoder models are more robust to noise than copy-biased models.
CMLM pretraining improves transformer robustness to noise.
Different noise types affect models differently.
Abstract
With a growing focus on morphological inflection systems for languages where high-quality data is scarce, training data noise is a serious but so far largely ignored concern. We aim at closing this gap by investigating the types of noise encountered within a pipeline for truly unsupervised morphological paradigm completion and its impact on morphological inflection systems: First, we propose an error taxonomy and annotation pipeline for inflection training data. Then, we compare the effect of different types of noise on multiple state-of-the-art inflection models. Finally, we propose a novel character-level masked language modeling (CMLM) pretraining objective and explore its impact on the models' resistance to noise. Our experiments show that various architectures are impacted differently by separate types of noise, but encoder-decoders tend to be more robust to noise than models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
MethodsFocus
