Variation of word frequencies in Russian literary texts
Vladislav Kargin

TL;DR
This paper investigates how word frequency variability in Russian literature relates to average frequency, revealing a power-law relationship and exploring latent factors influencing this distribution.
Contribution
It introduces a power-law model for word frequency volatility and analyzes latent factors to explain the distribution structure in Russian texts.
Findings
Frequency standard deviation follows a power law with exponent 0.62.
Rarer words exhibit higher frequency volatility.
Latent factors' asymmetry explains frequency variability.
Abstract
We study the variation of word frequencies in Russian literary texts. Our findings indicate that the standard deviation of a word's frequency across texts depends on its average frequency according to a power law with exponent showing that the rarer words have a relatively larger degree of frequency volatility (i.e., "burstiness"). Several latent factors models have been estimated to investigate the structure of the word frequency distribution. The dependence of a word's frequency volatility on its average frequency can be explained by the asymmetry in the distribution of latent factors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
