Entropy and Long range correlations in literary English
Werner Ebeling, Thorsten Poeschel

TL;DR
This study analyzes long-range correlations in literary texts using entropy measures, revealing power-law decay of mutual information and specific scaling laws, which enhance understanding of linguistic structure over large text segments.
Contribution
It introduces a detailed entropy-based analysis of long-range correlations in literary texts, demonstrating specific scaling laws and correlation ranges up to hundreds of letters.
Findings
Mutual information decays as a power law with distance.
Entropy per letter decreases with the inverse square root of subword length.
Number of distinct subwords grows as a stretched exponential with subword length.
Abstract
We investigated long range correlations in two literary texts, Moby Dick by H. Melville and Grimm's tales. The analysis is based on the calculation of entropy like quantities as the mutual information for pairs of letters and the entropy, the mean uncertainty, per letter. We further estimate the number of different subwords of a given length n. Filtering out the contributions due to the effects of the finite length of the texts, we find correlations ranging to a few hundred letters. Scaling laws for the mutual information (decay with a power law), for the entropy per letter (decay with the inverse square root of n) and for the word numbers (stretched exponential growth with n and with a power law of the text length) were found.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
