Correlation Dimension of Natural Language in a Statistical Manifold
Xin Du, Kumiko Tanaka-Ishii

TL;DR
This paper measures the correlation dimension of natural language using a novel approach on a statistical manifold, revealing universal multifractal properties and long memory effects in language and music sequences.
Contribution
It reformulates the Grassberger-Procaccia algorithm within a statistical manifold framework, enabling dimension analysis of probabilistic models of sequences.
Findings
Language has a universal correlation dimension around 6.5.
Long memory drives the self-similarity in language.
Method applies to various probabilistic models, including music data.
Abstract
The correlation dimension of natural language is measured by applying the Grassberger-Procaccia algorithm to high-dimensional sequences produced by a large-scale language model. This method, previously studied only in a Euclidean space, is reformulated in a statistical manifold via the Fisher-Rao distance. Language exhibits a multifractal, with global self-similarity and a universal dimension around 6.5, which is smaller than those of simple discrete random sequences and larger than that of a Barab\'asi-Albert process. Long memory is the key to producing self-similarity. Our method is applicable to any probabilistic model of real-world discrete sequences, and we show an application to music data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
