Non-Zipfian Distribution of Stopwords and Subset Selection Models
Wentian Li, Oscar Fontanelli

TL;DR
This paper investigates the distribution of stopwords and non-stopwords, revealing that stopwords follow a Beta Rank Function while non-stopwords deviate from Zipf's law, and proposes a subset selection model based on rank-dependent probabilities.
Contribution
It introduces a novel subset selection model using Hill's functions to explain the distribution of stopwords and non-stopwords, supported by analytical and empirical validation.
Findings
Stopwords follow a Beta Rank Function distribution.
Non-stopwords are better fitted by a quadratic function of log-token-count.
The proposed model explains the observed rank-frequency distributions.
Abstract
Stopwords are words that are not very informative to the content or the meaning of a language text. Most stopwords are function words but can also be common verbs, adjectives and adverbs. In contrast to the well known Zipf's law for rank-frequency plot for all words, the rank-frequency plot for stopwords are best fitted by the Beta Rank Function (BRF). On the other hand, the rank-frequency plots of non-stopwords also deviate from the Zipf's law, but are fitted better by a quadratic function of log-token-count over log-rank than by BRF. Based on the observed rank of stopwords in the full word list, we propose a stopword (subset) selection model that the probability for being selected as a function of the word's rank is a decreasing Hill's function (); whereas the probability for not being selected is the standard Hill's function (…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Language and cultural evolution · Complex Network Analysis Techniques
