Variation of word frequencies in Russian literary texts

Vladislav Kargin

arXiv:1503.00339·cs.CL·January 20, 2016

Variation of word frequencies in Russian literary texts

Vladislav Kargin

PDF

TL;DR

This paper investigates how word frequency variability in Russian literature relates to average frequency, revealing a power-law relationship and exploring latent factors influencing this distribution.

Contribution

It introduces a power-law model for word frequency volatility and analyzes latent factors to explain the distribution structure in Russian texts.

Findings

01

Frequency standard deviation follows a power law with exponent 0.62.

02

Rarer words exhibit higher frequency volatility.

03

Latent factors' asymmetry explains frequency variability.

Abstract

We study the variation of word frequencies in Russian literary texts. Our findings indicate that the standard deviation of a word's frequency across texts depends on its average frequency according to a power law with exponent $0.62,$ showing that the rarer words have a relatively larger degree of frequency volatility (i.e., "burstiness"). Several latent factors models have been estimated to investigate the structure of the word frequency distribution. The dependence of a word's frequency volatility on its average frequency can be explained by the asymmetry in the distribution of latent factors.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.