Markov Chain Monte Carlo for generating ranked textual data

Roy Cerqueti; Valerio Ficcadenti; Gurjeet Dhesi; Marcel; Ausloos

arXiv:2210.06963·stat.ME·October 14, 2022·Inf. Sci.

Markov Chain Monte Carlo for generating ranked textual data

Roy Cerqueti, Valerio Ficcadenti, Gurjeet Dhesi, Marcel, Ausloos

PDF

TL;DR

This paper introduces a Markov Chain Monte Carlo approach, specifically the Metropolis-Hastings algorithm, to analyze the rank-size distribution of words in text, demonstrating its effectiveness on US Presidential speeches and supporting its broader applicability.

Contribution

It presents a novel application of MCMC methods to rank-size law analysis in text data, establishing a Markov chain model for hapax legomena and validating its consistency through extensive statistical tests.

Findings

01

Hapax legomena follow a Markov chain of order one.

02

The method confirms the stochastic structure of rank-size distributions.

03

Hapaxes are characterized as rare, memory-less events.

Abstract

This paper faces a central theme in applied statistics and information science, which is the assessment of the stochastic structure of rank-size laws in text analysis. We consider the words in a corpus by ranking them on the basis of their frequencies in descending order. The starting point is that the ranked data generated in linguistic contexts can be viewed as the realisations of a discrete states Markov chain, whose stationary distribution behaves according to a discretisation of the best fitted rank-size law. The employed methodological toolkit is Markov Chain Monte Carlo, specifically referring to the Metropolis-Hastings algorithm. The theoretical framework is applied to the rank-size analysis of the hapax legomena occurring in the speeches of the US Presidents. We offer a large number of statistical tests leading to the consistency of our methodological proposal. To pursue our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.