Estimating Lexical Priors for Low-Frequency Syncretic Forms
Harald Baayen (Max Planck Institute for Psycholinguistics), Richard, Sproat (AT&T Bell Laboratories)

TL;DR
This paper proposes that the most effective way to estimate lexical priors for low-frequency, ambiguous forms is by analyzing their relative frequencies among hapax legomena, aiding morphological tagging.
Contribution
It introduces a novel estimator based on hapax legomena frequencies for lexical priors in low-frequency ambiguous forms, impacting stochastic morphological tagging.
Findings
Hapax legomena provide the best estimates for lexical priors.
Focusing on hapax legomena improves tagging accuracy for rare forms.
Sampling hapax legomena is more effective than extensive tagging of all frequencies.
Abstract
Given a previously unseen form that is morphologically n-ways ambiguous, what is the best estimator for the lexical prior probabilities for the various functions of the form? We argue that the best estimator is provided by computing the relative frequencies of the various functions among the hapax legomena --- the forms that occur exactly once in a corpus. This result has important implications for the development of stochastic morphological taggers, especially when some initial hand-tagging of a corpus is required: For predicting lexical priors for very low-frequency morphologically ambiguous types (most of which would not occur in any given corpus) one should concentrate on tagging a good representative sample of the hapax legomena, rather than extensively tagging words of all frequency ranges.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMasonry and Concrete Structural Analysis · Structural Load-Bearing Analysis · Seismic and Structural Analysis of Tall Buildings
