Probabilistic Method of Measuring Linguistic Productivity
Sergei Monakhov

TL;DR
This paper introduces a probabilistic method for measuring linguistic productivity that assesses an affix's ability to form new words independently of token frequency, using corpus-based simulation and evaluation on English and Russian data.
Contribution
It proposes a novel, corpus-based probabilistic approach to measure linguistic productivity that accounts for neologisms and is not biased by token frequency.
Findings
Productivity correlates with the number of word types.
High-frequency items increase first, followed by low-frequency items.
The method provides new insights into linguistic productivity dynamics.
Abstract
In this paper I propose a new way of measuring linguistic productivity that objectively assesses the ability of an affix to be used to coin new complex words and, unlike other popular measures, is not directly dependent upon token frequency. Specifically, I suggest that linguistic productivity may be viewed as the probability of an affix to combine with a random base. The advantages of this approach include the following. First, token frequency does not dominate the productivity measure but naturally influences the sampling of bases. Second, we are not just counting attested word types with an affix but rather simulating the construction of these types and then checking whether they are attested in the corpus. Third, a corpus-based approach and randomised design assure that true neologisms and words coined long ago have equal chances to be selected. The proposed algorithm is evaluated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Linguistics, Language Diversity, and Identity
