Correlation Dimension of Natural Language in a Statistical Manifold

Xin Du; Kumiko Tanaka-Ishii

arXiv:2405.06321·cs.CL·May 16, 2024

Correlation Dimension of Natural Language in a Statistical Manifold

Xin Du, Kumiko Tanaka-Ishii

PDF

TL;DR

This paper measures the correlation dimension of natural language using a novel approach on a statistical manifold, revealing universal multifractal properties and long memory effects in language and music sequences.

Contribution

It reformulates the Grassberger-Procaccia algorithm within a statistical manifold framework, enabling dimension analysis of probabilistic models of sequences.

Findings

01

Language has a universal correlation dimension around 6.5.

02

Long memory drives the self-similarity in language.

03

Method applies to various probabilistic models, including music data.

Abstract

The correlation dimension of natural language is measured by applying the Grassberger-Procaccia algorithm to high-dimensional sequences produced by a large-scale language model. This method, previously studied only in a Euclidean space, is reformulated in a statistical manifold via the Fisher-Rao distance. Language exhibits a multifractal, with global self-similarity and a universal dimension around 6.5, which is smaller than those of simple discrete random sequences and larger than that of a Barab\'asi-Albert process. Long memory is the key to producing self-similarity. Our method is applicable to any probabilistic model of real-world discrete sequences, and we show an application to music data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.