Long-Range Dependence in Word Time Series: The Cosine Correlation of Embeddings
Paweł Wieczyński, Łukasz Dębowski

TL;DR
This paper explores how word usage in texts shows long-term memory patterns, comparing human writing to AI-generated text.
Contribution
The study introduces a novel method using cosine correlation of word embeddings to detect long-range dependence in texts.
Findings
Cosine correlation of word2vec embeddings shows stretched exponential decay in human texts up to 1000-word lags.
Large language models do not exhibit systematic long-range dependence in their generated texts.
The findings suggest a need for memory-rich architectures beyond Transformers and hidden Markov models.
Abstract
We analyze long-range dependence (LRD) for word time series, understood as a slower than exponential decay of the two-point Shannon mutual information. We achieve this by examining the decay of the cosine correlation, a proxy object defined in terms of the cosine similarity between word2vec embeddings of two words, computed by an analogy to the Pearson correlation. By the Pinsker inequality, the squared cosine correlation between two random vectors lower bounds the mutual information between them. Using the Standardized Project Gutenberg Corpus, we find that the cosine correlation between word2vec embeddings exhibits a readily visible stretched exponential decay for lags roughly up to 1000 words, thus corroborating the presence of LRD. By contrast, for the Human vs. LLM Text Corpus entailing texts generated by large language models, there is no systematic signal of LRD. Our findings may…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Opinion Dynamics and Social Influence · Time Series Analysis and Forecasting
1. Introduction
Consider a time series such as a text in natural language, a sequence of real numbers, or a sequence of vectors. Let be the Shannon mutual information between two random variables separated by n positions. By short-range dependence (SRD), we understand an asymptotic exponential bound for the decay of this dependence measure,
By long-range dependence (LRD), we understand any sort of decay of the dependence measure that does not fall under (1). In particular, under LRD, we may have a power-law decay of the dependence measure,
which resembles a more standard definition of LRD for the autocorrelation function by Beran [1], or we may have a stretched exponential decay thereof,
The SRD is characteristic of mixing Markov and hidden Markov processes ([2], Theorem 1), which assume that the probability of the next token depends only on a finite number of preceding tokens or on a bounded memory. Hence, the observation of LRD for sufficiently large lags implies that the time series generation cannot be modeled by a mixing Markov process of a relatively small order or—via the data-processing inequality ([3], Chapter 2.8)—by a mixing hidden Markov process with a small number of hidden states.
By contrast, it has often been expressed that texts in natural language exhibit LRD [2,4,5,6,7,8,9,10,11]. Several empirical studies analyzing textual data at different linguistic levels, such as characters [2,4], words [9], or punctuation [11], have indicated that correlations in natural language persist over long distances. This persistent correlation suggests that dependencies in human language extend far beyond adjacent words or short phrases, spanning across entire paragraphs or even longer discourse structures.
The LRD should be put on par with other statistical effects signaling that natural language is not a finite-state hidden Markov process, a theoretical linguistic claim that dates back to [12,13,14]. Let us write blocks of words . A power-law growth of the block mutual information
is known as Hilberg’s law or as the neural scaling law [15,16,17]. Another observation [18] is a power-law logarithmic law of the maximal repetition length
where we denote the maximal repetition length
The long-range dependence (2) or (3), Hilberg’s law (4), and the maximal repetition law (5) have been all reported for natural language, whereas it can be mathematically proved that none of them is satisfied by finite-state hidden Markov processes [19,20].
The LRD, Hilberg’s law, and the maximal repetition law independently—and for different reasons—support the necessity of using complex memory architectures in contemporary large language models (LLMs). Neural networks designed for natural language processing must incorporate mechanisms capable of mimicking these laws. The older generation n-gram models struggle with this requirement for reasons that can be analyzed mathematically. By contrast, it is has been unclear whether Transformers [21], with their attention-based mechanisms, can leverage these extensive relationships. Understanding the nature of the LRD, Hilberg’s law, and the maximal repetition law in textual data may shed some light onto neural architectures that can progress on language modeling tasks.
Various smoothing techniques were proposed to discern LRD at the character or phoneme level [2,4,6,7]. Under no advanced estimation, the power-law decay of the Shannon mutual information between two characters dissolves into noise for lags up to 10 characters [4]. By contrast, Lin and Tegmark [2] considered sophisticated estimation techniques and reported the power-law decay of the Shannon mutual information between two characters for much larger lags.
Because of the arbitrariness of word forms relative to the semantic content of the text, we are not convinced that the results by Lin and Tegmark [2] are not an artifact of their estimation method. For this reason, following the idea of Mikhaylovskiy and Churilov [9], we have decided to seek the LRD on the level of words. We have supposed that pairs of words rather than pairs of characters better capture the long-range semantic coherence of the text. For this reason, we have expected that the LRD effect extends for a larger distance on the level of words than on the level of characters. Indeed, in the present study, we report a lower bound on the Shannon mutual information between two words that is salient for lags up to 1000 words, which is four decades of magnitude larger than the unsmoothed effect for characters.
A modest goal of this paper is to systematically explore a simple measure of dependence to check whether texts in natural language and those generated by large language models exhibit the LRD. Rather than directly investigating the Shannon mutual information, which is difficult to estimate for large alphabets and strongly dependent sources, we elect a measure of dependence called the cosine correlation. This object is related to the cosine similarity of two vectors and somewhat resembles the Pearson correlation. Formally, the cosine correlation between two random vectors U and V equals
where is the expectation of random variable X, is the dot product, and is the norm. By contrast, the cosine similarity of two non-random vectors u and v is
In order to compute the cosine correlation or the cosine similarity for actual word time series, we need a certain vector representation of words. As a practical vector representation of words, one may consider word2vec embeddings used in large language models [22,23]. Word embeddings capture semantic relationships between words by mapping them into continuous spaces, allowing for a more meaningful measure of similarity between distant words in a text. In particular, Mikhaylovskiy and Churilov [9] observed an approximate power-law decay for the expected cosine similarity of word embeddings.
The paper by Mikhaylovskiy and Churilov [9] lacked, however, the following important theoretical insight. As a novel result of this paper, we demonstrate that the cosine correlation , rather than the expected cosine similarity , provides a lower bound for the Shannon mutual information . Applying the Pinsker inequality [24,25], we obtain the bound
This approach provides an efficient alternative to direct statistical estimation of mutual information, which is often impractical due to the sparse nature of natural language data. In particular, a slower than exponential decay of the cosine correlation implies LRD. Thus, a time series with a power-law or stretched exponential decay of the cosine correlation is not a Markov process or a hidden Markov process.
Indeed, on the experimental side, we observe a stretched exponential decay of the cosine correlation, which is clearly visible roughly for lags up to 1000 words—but only for natural texts. By contrast, artificial texts do not exhibit this trend in a systematic way. Our source of natural texts is the Standardized Project Gutenberg Corpus [26], a diverse collection of literary texts that offers a representative sample of human language usage. Our source of artificial texts is the Human vs. LLM Text Corpus [27]. To investigate the effect of semantic correlations, we also consider the cosine correlation between moving sums of neighboring embeddings, a technique that we call pooling. Curiously, pooling does not make the stretched exponential decay substantially slower. The lack of a prominent LRD signal was already noticed for the previous generation of language models by Takahashi and Tanaka-Ishii [6,7].
Our observation of the slow decay of the cosine correlation in general confirms the prior results of Mikhaylovskiy and Churilov [9] and supports the hypothesis of LRD. We notice that Mikhaylovskiy and Churilov [9] did not try to fit the stretched exponential decay to their data and their power-law model was not visually very good. Both theoretical and experimental findings of this paper contribute to the growing body of statistical evidence proving that natural language is not a finite-state hidden Markov process.
What is more novel is that our findings may support the view that natural language cannot be either generated by Transformer-based large language models—in view of no systematic decay trend of the cosine correlation for the Human vs. LLM Text Corpus. As mentioned, the LRD, Hilberg’s law, and the maximal repetition law independently substantiate the necessity of sophisticated memory architectures in modern computational linguistic applications. These results open avenues for further research into the theoretical underpinnings of language structure, potentially informing the development of more effective models for language understanding and generation.
The organization of the article is as follows. Section 2 presents the theoretical results. Section 3 discusses the experiment. In particular, Section 3.1 presents our data. Section 3.2 describes the experimental methods. Section 3.3 presents the results. Section 3.4 offers the discussion. Section 4 contains the conclusion.
2. Theory
Similarly as Mikhaylovskiy and Churilov [9] but differently than Li [4], Lin and Tegmark [2], and Takahashi and Tanaka-Ishii [6,7], we will seek for LRD on the level of words rather than on the level of characters or phonemes. The Shannon mutual information between words is difficult to estimate for large alphabets and strongly dependent sources. Thus, we consider its lower bound defined via the cosine correlation of word2vec embeddings [22,23].
Let denote the expectation of a real random variable X. Let be the natural logarithm of x and let be the Shannon entropy of a discrete random variable X, where is the probability density of X with respect to a reference measure ([3], Chapters 2.1 and 8.1). The Shannon mutual information between variables X and Y ([3], Chapters 2.4 and 8.5) equals
By contrast, the Pearson correlation between real random variables X and Y is defined as
where we denote the covariance and the variance . By the Cauchy–Schwarz inequality, we have .
We will introduce an analog of the Pearson correlation coefficient for vectors, which we call the cosine correlation. First, let us recall three standard concepts. For vectors and , we consider the dot product
the norm , and the cosine similarity
By the Cauchy–Schwarz inequality, we have .
Now, we consider something less standard. For vector random variables U and V, we define the cosine correlation
If U and V are discrete and we denote the difference of measures
then we may write
We observe that if random variables U and V are unidimensional, then with probability 1 and . Similarly, if is constant with probability 1 or if U and V are independent.
To build some more intuitions, let us notice the following three facts. First of all, the cosine correlation between two copies of a random vector lies in the unit interval.
Theorem 1. We have
Proof. Let us write . We have
Hence, the claim follows. □
Second, the cosine correlation satisfies a version of the Cauchy–Schwarz inequality.
Theorem 2. We have
Proof. Let us write and . By the Cauchy–Schwarz inequalities for random scalars X and Y and for real numbers , we obtain
Hence, the claim follows (17). □
Third, we will show that cosine correlation provides a lower bound for mutual information .
Theorem 3. We have
Proof. Let us recall the Pinsker inequality
for two discrete probability distributions p and q [24,25]. By the Pinsker inequality (22), the Cauchy–Schwarz inequality , and identity (16), we obtain
Hence, the claim follows. □
We note in passing that the Pinsker inequality can be modified as the Bretagnolle–Huber bound
for probability distributions p and q [28,29]. Respectively, we obtain
This bound is weaker than (21) since .
Let be the text in natural language treated as a word time series. Let be an arbitrary vector representation of word w, such as word2vec embeddings [22,23], and let . In particular, since embeddings are functions of words , by the data-processing inequality ([3], Chapter 2.8) and by the cosine correlation bound (21), we obtain
Wrapping up, a slow decay of cosine correlation implies a slow decay of mutual information . Since is damped exponentially for any mixing Markov or hidden Markov process by Theorem 1 of Lin and Tegmark [2], observing a power-law or a stretched exponential decay of cosine correlation is enough to demonstrate that process is not a mixing Markov or hidden Markov process.
The framework that we have constructed in this section has its prior in the literature. We remark that Mikhaylovskiy and Churilov [9] investigated estimates of expectation rather than cosine correlation . That approach required estimation and subtraction of the asymptotic constant term. Mikhaylovskiy and Churilov [9] observed an approximate power-law decay but they did not mention the cosine correlation bound (21) in their discussion explicitly.
3. Experiment
3.1. Data
Our data consisted of three elements: a dictionary of embedding vectors for a subset of human languages, a corpus of texts written by humans in these languages, and a corpus of texts in English created by artificial intelligence. The considered set of human languages included 17 languages. Originally, we planned to use 20 languages with the largest text counts in the considered corpora but three of them, Esperanto, Chinese, and Tagalog, had to be excluded because the embedding dictionary did not cover these languages.
In particular, the source of pretrained word embeddings was chosen as the NLPL repository [23]. To provide a uniform baseline across languages, for all considered languages, we used 100-dimensional embedding vectors trained on the CoNLL17 corpora with the same algorithm, being the word2vec continuous skipgram algorithm. None of these embedding vector spaces includes lemmatization. The vocabulary sizes of the embedding spaces for the considered 17 languages are presented in Table 1.
As the source of texts written by humans, we chose the Standardized Project Gutenberg Corpus (SPGC) [26]. The corpus provides texts after some preprocessing and tokenization, as detailed in [26]. We filtered the SPGC to obtain a more manageable yet representative subset of texts. As we have mentioned, we restricted the corpus to 17 languages with the largest text counts simultaneously covered by the applied NLPL embedding dictionary. Moreover, we filtered out files of the size above 1000 KB and we sampled up to 100 texts (or fewer if not available) per language in order to achieve roughly balanced subsets across particular languages.
To provide a comparison with texts generated by artificial intelligence, we also considered the Human vs. LLM Text Corpus (HLLMTC) [27]. All texts in the HLLMTC are in English. To make this corpus more easily computationally tractable, we sampled 1000 human written texts and 6000 LLM generated texts, where we chose 1000 texts per each of the six selected large language models. To convert these texts into word time series, we used off-the-shelf tokenizer [30].
Table 2 provides the summary statistics of the obtained subsets of the Standardized Project Gutenberg Corpus and the Human vs. LLM Text Corpus. In particular, we report the token counts and the coverage of the sampled texts, i.e., the fraction of word tokens of texts that appear in the respective NLPL embedding dictionary.
3.2. Methods
In this section, we briefly describe what we measured and in what way. We supposed that the LRD on the level of words is due to semantic coherence of the text over longer distances. In particular, mutual information between two words is large as long as the text around these words concerns a similar topic. We supposed that the embedding of this local topic can be roughly estimated as the sum of embeddings of all words in the neighborhood, called a pooled embedding. Let be the embedding of the i-th word in the text. The pooled embeddings are defined as
for the pooling order . In particular, pooled embeddings for equal word embeddings, .
The object that we wanted to measure was the cosine correlation for pooled embeddings, namely
Function is substantially larger for since the summations for variables and range partly over overlapping embeddings . Thus, if one wants to estimate the functional form of the decay of , it makes sense to fit the respective function exclusively to data points where .
Let us proceed to the estimation of function . Let be the embedding of word w according to the considered word2vec dictionary. From each text, we removed all word tokens that did not have an embedding in the dictionary. In this way, we obtained a collection of word time series , corresponding word embeddings , and pooled embeddings given by formula (27). We estimated the expectations as the averages over the times series. That is, we computed the estimator of defined as
where we used the auxiliary time series
We observe that . Therefore, the computational complexity of estimator for fixed n and k is of order , where N is the text length and d is the dimension of embeddings .
For each text , we computed estimators for lags , where
and pooling orders . We observed that the plot of the absolute value for considered texts usually dissolved into random noise around and there was a hump for , as expected. Hence, to estimate the functional form of the decay of , we restricted the fitting procedure to range .
The parameter estimation was performed using the curve_fit function from the SciPy library [31], which employs the trust region reflective algorithm. We selected this method due to its compatibility with bounded constraints. We estimated parameters of two functions: the power-law decay
and the stretched exponential decay
with parameters , , and implicitly depending on the pooling order k. As a goodness-of-fit metric, we calculated the sum of squared logarithmic residuals
divided by the number of the degrees of freedom (ndf) equal to minus the number of parameters of .
We investigated the dependence of the results on the source, understood as the particular language for human-written texts or the particular language model for LLM-generated texts. To check whether there are significant differences of the distribution of a parameter across particular sources, we used the non-parametric Kruskal–Wallis test with the null hypothesis
where is the distribution of parameter for the j-th source. To further explore differences among different sources, we employed the post-hoc Dunn test with the Bonferroni correction for multiple comparisons.
3.3. Results
Visually, the decay of the absolute cosine correlation estimates for usually follows a stretched exponential form rather than the exact power-law decay for human-written text. By contrast, no systematic decay for can be detected for LLM-generated texts. This tendency can be seen in Figure 1, which is a diagnostic plot of the absolute cosine correlation estimates for two texts: Cecilia: A Story of Modern Rome in English from the SPGC corpus and Text no. 702 by GPT 3.5, which is the longest LLM-generated text in the sampled subset of the HLLMTC corpus.
In Table 3, Table 4, Table 5, Table 6 and Table 7, we report the means and the standard deviations of the fitted parameters c and of the power-law model (32) and b, , and of the stretched exponential model (33). The values are reported as they depend on a particular language for human-written texts or on a particular language model for LLM-generated texts. When fitting the models, the optimization algorithm did not converge sometimes. The failure rates and the overall goodness of fit are reported in Table 8. Despite the visual appeal of the stretched exponential model, the mean SSLR given by Formula (34) is less for the power-law model. This does not necessarily mean that the power-law model is better, however, since the standard deviation of the SSLR is greater than the mean for the stretched exponential model.
3.4. Discussion
Similarly as Mikhaylovskiy and Churilov [9] but differently than Li [4], Lin and Tegmark [2], and Takahashi and Tanaka-Ishii [6,7], we have sought for the LRD on the level of words rather than on the level of characters or phonemes. We have hypothesized that word-level dependencies yield a more prominent effect due to semantic coherence of lexical units over longer distances as compared to phoneme-level correlations, which tend to decay faster, in view of the arbitrariness of word forms.
Indeed, analyzing the cosine similarity of word embeddings, like Mikhaylovskiy and Churilov [9], or their cosine correlation, in the present study, one observes a clearly visible LRD effect for natural, i.e., human-written texts. Mikhaylovskiy and Churilov [9] reported a rough power-law decay without considering an alternative model. By contrast, we have considered both a power-law model and a stretched exponential model and both natural texts and LLM-generated texts.
We report that the slow decay of the cosine correlation extends up to 1000 words for natural texts, whereas it is dominated by noise for LLM-generated texts—as it was already observed for the previous generation of language models [6,7]. These effects can be seen in the diagnostic Figure 1 and independently witnessed by Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8, where fitting to the random noise results in highly unstable estimates and outliers pumping up the standard deviations beyond the means. Curiously, the decay of the cosine correlation does not change systematically as the pooling order k increases, despite our prior expectation that the cosine correlation would increase monotonically with k.
The distributions of fitted parameters c and of the power-law model (32) and b, , and of the stretched exponential model (33) vary significantly across different human languages and different large language models ( for the Kruskall–Wallis tests). It means that the cosine correlation decays at different source-specific rates. At the moment, we are unable to state clearly what the cause for this variation may be.
For example, the Japanese language seems an outlier in many categories but this need not be directly caused by language typology. We notice that the available texts in Japanese are very short and their coverage in terms of embeddings is much lower than for other languages. Maybe our experimental methodology fails for very short texts in general. This might be an alternative explanation of the poor fitting results for LLM-generated texts which are very short, as well, as shown in Table 2 and Table 8.
4. Conclusions
In this paper, we have provided an empirical support for the claim that texts in natural language exhibit long-range dependence (LRD), understood as a slower than exponential decay of the two-point mutual information. Similar claims have been reiterated in the literature [2,4,5,6,7,8,9,10,11] but we hope that we have provided more direct and convincing evidence.
First, as a theoretical result, we have shown that the squared cosine correlation lower bounds the Shannon mutual information between two vectors. Under this bound, a power-law or a stretched exponential decay of the cosine correlation implies the LRD. In particular, the vector time series which exhibits such a slow decay of the cosine correlation cannot be not a mixing Markov or hidden Markov process by Theorem 1 of Lin and Tegmark [2].
Second, using the Standardized Project Gutenberg Corpus [26] and vector representations of words taken from the NLPL repository [23], we have shown experimentally that the estimates of the cosine correlation of word embeddings follow a stretched exponential decay. This decay extends for lags up to 1000 words without any smoothing, which is four decades of magnitude larger than the unsmoothed, presumably LRD, effect for characters [4].
Third, the stability of this decay suggests that the LRD is a fundamental property of natural language, rather than an artifact of specific preprocessing methods or statistical estimation techniques. The observation of the slow decay of the cosine correlation for natural texts not only supports the hypothesis of LRD but also reaffirms the prior results of Mikhaylovskiy and Churilov [9], who reported a rough power-law decay of the expected cosine similarity of word embeddings.
Fourth, like Takahashi and Tanaka-Ishii [6,7], we have observed the LRD only for natural data. We stress that, as we were able to observe, artificial data do not exhibit the LRD in a systematic fashion. Our source of artificial texts was the Human vs. LLM Text Corpus [27]. We admit that texts in this corpus may be too short to draw firm conclusions and further research on longer LLM-generated texts is necessary to confirm our early claim.
As we have mentioned in the introduction, non-Markovianity effects such as the LRD, Hilberg’s law [15,16,17], and the maximal repetition law [18] may have implications for understanding the limitations and capabilities of contemporary language models. The presence of such effects in natural texts in contrast to texts generated by language models highlights the indispensability of complex memory mechanisms, potentially showing that state-of-the-art architectures such as Transformers [21] are insufficient.
Future research might explore whether novel architectures could capture quantitative linguistic constraints such as the LRD more effectively [32]. Further studies may also explore alternative embeddings or dependence measures and their impact on the stability of the LRD measures such as the stretched exponential decay parameters. Investigating other linguistic corpora, text genres, and languages could also provide valuable insights into the universality of these findings.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Beran J. Statistics for Long-Memory Processes Chapman & Hall New York, NY, USA 1994
- 2Lin H.W. Tegmark M. Critical Behavior in Physics and Probabilistic Formal Languages Entropy 20171929910.3390/e 19070299 · doi ↗
- 3Cover T.M. Thomas J.A. Elements of Information Theory 2nd ed.Wiley & Sons New York, NY, USA 2006
- 4Li W. Mutual Information Functions versus Correlation Functions J. Stat. Phys.19906082383710.1007/BF 01025996 · doi ↗
- 5Altmann E.G. Pierrehumbert J.B. Motter A.E. Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words P Lo S ONE 20094 e 767810.1371/journal.pone.000767819907645 PMC 2770836 · doi ↗ · pubmed ↗
- 6Takahashi S. Tanaka-Ishii K. Do neural nets learn statistical laws behind natural language?P Lo S ONE 201712 e 018932610.1371/journal.pone.018932629287076 PMC 5747447 · doi ↗ · pubmed ↗
- 7Takahashi S. Tanaka-Ishii K. Evaluating computational language models with scaling properties of natural language Comput. Linguist.20194548151310.1162/coli_a_00355 · doi ↗
- 8Tanaka-Ishii K. Statistical Universals of Language: Mathematical Chance vs. Human Choice Springer New York, NY, USA 2021
