Generalized Entropies and the Similarity of Texts
Eduardo G. Altmann, Laercio Dias, and Martin Gerlach

TL;DR
This paper explores how generalized entropies and divergences based on word frequency distributions can reveal insights into text similarity and the contribution of specific words, with applications to large text corpora.
Contribution
It demonstrates that generalized entropies and Jensen-Shannon divergences are dominated by words within specific frequency ranges, providing new methods to analyze text similarity.
Findings
Generalized entropies are dominated by words in specific frequency ranges.
Generalized divergences can identify contributions of particular words to text similarity.
Estimates of database size needed for reliable divergence measurement are provided.
Abstract
We show how generalized Gibbs-Shannon entropies can provide new insights on the statistical properties of texts. The universal distribution of word frequencies (Zipf's law) implies that the generalized entropies, computed at the word level, are dominated by words in a specific range of frequencies. Here we show that this is the case not only for the generalized entropies but also for the generalized (Jensen-Shannon) divergences, used to compute the similarity between different texts. This finding allows us to identify the contribution of specific words (and word frequencies) for the different generalized entropies and also to estimate the size of the databases needed to obtain a reliable estimation of the divergences. We test our results in large databases of books (from the Google n-gram database) and scientific papers (indexed by Web of Science).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
