Corpus Frequencies in Morphological Inflection: Do They Matter?
Tom\'a\v{s} Sourada, Jana Strakov\'a

TL;DR
This paper investigates the impact of incorporating corpus frequency information into morphological inflection tasks, proposing methods that improve model performance by reflecting real-world word usage distributions.
Contribution
It introduces frequency-aware training, combines lemma-disjoint splits with frequency weighting, and compares token versus type accuracy for more realistic evaluation.
Findings
Frequency-aware training outperforms uniform sampling in 26 of 43 languages.
Token accuracy better reflects real-world performance on frequent words.
Incorporating frequency information improves generalization and evaluation metrics.
Abstract
The traditional approach to morphological inflection (the task of modifying a base word (lemma) to express grammatical categories) has been, for decades, to consider lexical entries of lemma-tag-form triples uniformly, lacking any information about their frequency distribution. However, in production deployment, one might expect the user inputs to reflect a real-world distribution of frequencies in natural texts. With future deployment in mind, we explore the incorporation of corpus frequency information into the task of morphological inflection along three key dimensions during system development: (i) for train-dev-test split, we combine a lemma-disjoint approach, which evaluates the model's generalization capabilities, with a frequency-weighted strategy to better reflect the realistic distribution of items across different frequency bands in training and test sets; (ii) for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
