Non-Zipfian Distribution of Stopwords and Subset Selection Models

Wentian Li; Oscar Fontanelli

arXiv:2603.04691·cs.CL·March 6, 2026

Non-Zipfian Distribution of Stopwords and Subset Selection Models

Wentian Li, Oscar Fontanelli

PDF

Open Access

TL;DR

This paper investigates the distribution of stopwords and non-stopwords, revealing that stopwords follow a Beta Rank Function while non-stopwords deviate from Zipf's law, and proposes a subset selection model based on rank-dependent probabilities.

Contribution

It introduces a novel subset selection model using Hill's functions to explain the distribution of stopwords and non-stopwords, supported by analytical and empirical validation.

Findings

01

Stopwords follow a Beta Rank Function distribution.

02

Non-stopwords are better fitted by a quadratic function of log-token-count.

03

The proposed model explains the observed rank-frequency distributions.

Abstract

Stopwords are words that are not very informative to the content or the meaning of a language text. Most stopwords are function words but can also be common verbs, adjectives and adverbs. In contrast to the well known Zipf's law for rank-frequency plot for all words, the rank-frequency plot for stopwords are best fitted by the Beta Rank Function (BRF). On the other hand, the rank-frequency plots of non-stopwords also deviate from the Zipf's law, but are fitted better by a quadratic function of log-token-count over log-rank than by BRF. Based on the observed rank of stopwords in the full word list, we propose a stopword (subset) selection model that the probability for being selected as a function of the word's rank $r$ is a decreasing Hill's function ( $1/ (1 + (r / r_{mi d})^{γ})$ ); whereas the probability for not being selected is the standard Hill's function (…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Language and cultural evolution · Complex Network Analysis Techniques