Universal parameterized family of distributions of runs
Hayato Takahashi

TL;DR
This paper derives explicit formulas for probabilities related to runs and nonoverlapping words in i.i.d. finite-valued sequences, generalizing previous results and analyzing computational complexity.
Contribution
It introduces a unified explicit formula for run probabilities in i.i.d. sequences, extending -overlapping probabilities and analyzing computational efficiency.
Findings
Explicit formulas for run probabilities in i.i.d. sequences
Linear computational complexity for fixed parameters
Asymptotic analysis of integer partitions
Abstract
We present explicit formulae for parameterized families of probabilities of the number of nonoverlapping words and increasing nonoverlapping words in independent and identically distributed (i.i.d.) finite valued random variables, respectively. Then we provide an explicit formula for a parameterized family of probabilities of the number of runs, which generalizes \(\mu\)-overlapping probabilities for \(\mu\geq 0\) in i.i.d.~binary valued random variables. We also demonstrate exact probabilities of the number of runs whose size are exactly given numbers (Mood 1940). The number of arithmetic operations required to compute our formula for generalized probabilities of runs is linear order of sample size for fixed number of parameters and range. To analyse these number of arithmetic operations for unbounded number of parameters, we show an asymptotic formula for the number of integer…
| 1 | 3 | 5 | 7 | 9 | |
|---|---|---|---|---|---|
| 0.117859 | 0.0168652 | 0.0036909 | 0.0009005 | 0.0002248 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Advanced Clustering Algorithms Research · Bayesian Modeling and Causal Inference
Universal parameterized family of distributions of runs
111Parts of the paper have been presented MSJ2017, MSJ2023, ICIAM 2023, Takahashi (2023a, b)
Hayato Takahashi222Random Data Lab. Inc., Tokyo 1210062, Email: [email protected]
Abstract
We present explicit formulae for parameterized families of distributions of the number of nonoverlapping words and increasing nonverlapping words in independent and identically distributed (i.i.d.) finite valued random variables, respectively. Then we provide an explicit formula for a parameterized family of distributions of the number of runs, which generalizes -overlapping distributions for in i.i.d. binary valued random variables. We also demonstrate that of runs whose size are exactly given numbers (Mood 1940). The number of arithmetic operations required to compute our formula for generalized distributions of runs for fixed number of parameters and fixed range is linear order of sample size.
**Keywords: exact distribution, scan, run, pattern, inclusion-exclusion principles
**Mathematics Subject Classification: 05A15, 62E15
1 Introduction
We study distributions of the number of words in finite valued i.i.d. random variables (distributions of words for short). The distributions of words play important role in statistics, DNA analysis, information theory, see Balakrishnan & Koutras (2002); Berthé & Rigo (2016); Feller (1970); Jacquet & Szpankowski (2015); Lothaire (2005); Mood (1940); Robin et al. (2005); Wald & Wolfowitz (1940); Waterman (1995), and Zehavi & Wolf (1988).
Generating functions of the distributions of words obtained by inductive relations of words on sample size are inevitably rational functions, see Bassino et al. (2010); Blom & Thorburn (1982); Chrysaphinou & Papastavridis (1988); Flajolet & Sedgewick (2009); Goulden & Jackson (1983); Guibas & Odlyzko (1981), and Régnier & Szpankowski (1998). Feller (1970), Jacquet & Szpankowski (2015), and Robin et al. (2005) obtain approximations and recurrence formulae for the distributions of words from rational generating functions. Uppuluri & Patil (1983) and Antzoulakos & Chadjiconstantindis (2001) obtain explicit formulae by expanding rational generating functions into power series. However, in general, expanding rational functions into power series is not immediate cf. Chapter 11 Section 4 pp. 275 Feller (1970).
A word that consists of the same letter is called a run. The number of runs depends on the counting manner. Let be the word that consists of zeros. For , let
(i) , the number of of size exactly in (Mood, 1940; Fu & Koutras, 1994),
(ii) , the number of of size greater than or equal to in (Fu & Koutras, 1994; Antzoulakos & Chadjiconstantindis, 2001),
(iii) , the number of nonoverlapping in (Godbole, 1990; Hirano, 1986; Muselli, 1996; Antzoulakos & Chadjiconstantindis, 2001; Fu & Koutras, 1994; Feller, 1970),
(iv) , the number of overlapping in (Ling, 1988; Antzoulakos & Chadjiconstantindis, 2001; Koutras & Alexandrou, 1997; Fu & Koutras, 1994; Godbole, 1992),
(v) , the size of the longest run of 0s in (Makri et al., 2007; Phillipou & Makri, 1986; Antzoulakos & Chadjiconstantindis, 2001; Fu & Koutras, 1994),
(vi) , the stopping time such that first appear in (Aki et al., 1984; Philippou et al., 1983; Uppuluri & Patil, 1983), and
(vii) , the enumeration of such that we allow -letters overlapping with the previous in the string (Aki & Hirano, 2000; Han & Aki, 2000; Makri & Psillakis, 2015).
Fu & Koutras (1994) provides nonparametric exact distributions of runs by Markov imbedding method. Though obtaining parametric models for distributions of words is desirable (Stefanov & Pakes, 1997), as far as the author understand, no explicit formulae for parameterized families of distributions of nonoverlapping words, nonoverlapping increasing words, , and are known.
In this paper, we present explicit formulae for distributions of these statistics. To avoid the difficulty of enumerating overlapping words and expanding rational functions into power series, in Theorem 3.2, we study distributions of increasing nonoverlapping words and their finite dimensional generating functions. Combining Theorem 3.2 with a combinatorial lemma, in Theorem 3.6, we derive explicit formulae for parameterized distributions of runs including those of the statistics (i)–(vii) above by a unified manner for binary valued i.i.d. random variables. Generalization of our formulae in Theorem 3.6 to those for countable valued i.i.d. random variables are straightforward, see Remark 3.7.
The rest of the paper consists as follows. In Section 2 Theorem 2.1 and 2.2, we show explicit formulae for parameterized families of joint probabilities of nonoverlapping words and their moments for finite valued i.i.d. radnom variables. In Section 4, we study distance among the distributions of runs. In Section 5, we show algorithm and complexity to compute our formulae.
2 Joint distributions of nonoverlapping words
A finite string of a finite alphabet is called a word. Let be the length of a word . The word is the concatenation of two words and . The word is the -times concatenations of a word , e.g. . A word is called overlapping if there is a word such that appears at least 2 times in and ; otherwise is called nonoverlapping. A pair of words is called overlapping if there is a word such that and appear in and ; otherwise the pair is called nonoverlapping. A finite set of words is called nonoverlapping if every and pair are nonoverlapping; otherwise is called overlapping. For example, are nonoverlapping; and are overlapping.
In the following, let be the number of words in an arbitrary position of , i.e.
[TABLE]
where and if else 0 for all . For , let
[TABLE]
where . Let be a probability on , i.e., for and . Set for . For example for all if for .
Theorem 2.1
Let be a finite alphabet and a probability on . Let be -valued i.i.d. random variables from for . Let be nonoverlapping. Let
[TABLE]
Then
[TABLE]
Proof) We prove the theorem for . The proof for the general case is similar. The number of possible allocations such that and appear and times respectively without overlapping in the strings of length is
[TABLE]
This is because if we replace and with additional extra symbols and in the strings of length then the problem reduces to choosing s and s among the strings of length . Let
[TABLE]
The function is not the probability of s and s occurrences in the string, since we allow any letters in the remaining place except for s and s. Let be the probability that and appear and times, respectively. We have the following identity,
[TABLE]
Then
[TABLE]
We have
[TABLE]
and (1) . ∎
Régnier & Szpankowski (1998) show expectation, variance, and central limit theorems for the occurrences of words. Rukhin & Volkovich (2008) study chi-squared tests with nonoverlapping words. We give all orders of moments for nonoverlapping words. Let for all . Then is the number of surjective functions from for all , see pp.100 Problem 1 Riordan (1958). Let be the greatest integer less than or equal to .
Theorem 2.2
Let be nonoverlapping. Under the same assumption with in Theorem 2.1,
[TABLE]
for all .
Proof) Let . We say that is the support of . are called disjoint if their support are disjoint, where for . Since is nonoverlapping, we have
[TABLE]
Let for all . Then
[TABLE]
By (3), if and only if there is a disjoint set such that .
The number of possible combination of disjoint is . If then there is no disjoint s. For each disjoint , the number of possible combination of such that is . By (4), we have the theorem. ∎
3 Explicit formulae for distributions of runs
First we show probability functions for increasing nonoverlapping words.
Let
[TABLE]
For example and . We write if is a prefix of and . For example . If and then for all .
Definition 3.1
Let
[TABLE]
where be increasing nonoverlapping words.
Theorem 3.2
Let be a finite alphabet and a probability on . Let be -valued i.i.d. finite valued random variables from for . Let be increasing nonoverlapping words and
[TABLE]
Then
[TABLE]
Proof) We show (7) for . The proof of the general case is similar. Observe that
[TABLE]
Then
[TABLE]
Next, set in (7). Then
[TABLE]
By setting in (10), we have
[TABLE]
Since
[TABLE]
is the coefficient of in . On the other hand, by expanding the left-hand-side of (11), we have
[TABLE]
and (8). ∎
Eq. (7) is an inclusion-exclusion principle for increasing nonoverlapping words.
To derive a universal formula for probability functions of runs, we introduce a statistics that represents various types of runs.
Definition 3.3
For , let
[TABLE]
where .
Example 3.4
Consider a run and let .
1. Let for . Then and (0-overlapping enumeration).
2. Let for . Then and (1-overlapping enumeration).
3. Let for and Then and (2-overlapping enumeration).
4. Let . Then and .
When , the difference between and is that count for from the beginning of while does not.
Lemma 3.5
Let be i.i.d. binary random variables from and for all . Let and for . Then for all ,
[TABLE]
Proof) Observe that
[TABLE]
We have
[TABLE]
∎
Theorem 3.6** **(main theorem)
Let be i.i.d. binary random variables from and for all . Let and for . Then for all ,
[TABLE]
Proof) Part 1 follows from Theorem 3.2 and Lemma 3.5. Part 2 follows from part 1. Part 3 follows from . Part 4 follows from .
Proof of part 5. Let , , and in Theorem 3.2. By (7), we have
[TABLE]
Set and . We have
[TABLE]
where , the number of of size exactly in .
On the other hand,
[TABLE]
Since , from (14) and (15), we have
[TABLE]
By similar manner to Lemma 3.5, we have part 5. ∎
Remark 3.7
It is straightforward to extend i.i.d. binary valued random variables in Theorem 3.6 to those of countable values. Let , be a sequence of non-negative reals such that and be i.i.d. trials from for all . Let are binary i.i.d. trials from and for all . Then for all .
4 Distance of distributions
We show that and uniformly converge to and as , respectively.
Proposition 4.1
Let be i.i.d. binary random variables from . Assume . Then
[TABLE]
Proof) Assume that . By (5), if . Then for all ,
[TABLE]
Let for all . By Theorem 3.6, for all ,
[TABLE]
where the last inequality follows from (16) and . ∎
Assume that be i.i.d. binary random variables from . Let
[TABLE]
Table 1 shows numerical calculations of for , and for . Figure 1 shows graphs of for , and .
5 Algorithm and computational complexity
We study algorithm and computational complexity to compute (8). The basic idea of our algorithm is similar to that of bucket sort (Cormen et al. (2009)). When is negligible for some , it is suffice to compute for . The following Algorithm A compute for all .
Let and
[TABLE]
**Algorithm A
**1. Initialize for all .
2. Enumerate all nonnegative vectors .
- For each vector , set
[TABLE]
where .
3. Output for all .
Since Algorithm A enumerates all combination of for given in (8), Algorithm A correctly computes for all .
The bottle neck of computational complexity of Algorithm A is the size of . In Theorem 5.2, we give an upper bound of the size of . Algorithm and computational complexity for computing is similar to that of .
First we prove a lemma. Let be the number of the elements of a finite set and
[TABLE]
Lemma 5.1
For all ,
[TABLE]
Proof) We prove the lemma by induction on . If , and (18) is true. Assume that (18) is true for some . Let for . Since
[TABLE]
we have , where . Let be the least integer that is greater than or equal to . Since is convex, we have
[TABLE]
and (18) is true for . By induction, we have the lemma. ∎
Theorem 5.2
*Let be the sample size. For given and ,
1.*
[TABLE]
2. Fix and . Then
[TABLE]
3. Fix . Then
[TABLE]
In particular if and . Then
[TABLE]
4. Let be positive constants, , and . Then
[TABLE]
In particular, and then for all and ,
[TABLE]
Proof) Let . Then
[TABLE]
By Lemma 18, the number of that satisfies (19) is less than or equal to . By (17), we have , and the number of such that is less than or equal to . By (17), the number of possible such that for each fixed is less than or equal to , and we have part 1.
Part 2 and 3 follow from part 1.
Proof of part 4. Let . Then
[TABLE]
where (20) follows from , and (21) follows from Stirling formula . Let . By (21), we have . By part 1 we have part 4. ∎
Remark 5.3
In Theorem 5.2 4, if then . On the other hand, to compute exact distributions by Markov imbedding method, we need to calculate for sample size and matrix with . The number of arithmetical operations to compute is and those of is with Strassen algorithm (Cormen et al. (2009)).
Acknowledgement
This work was supported by the Research Institute for Mathematical Sciences, an International Joint Usage/Research Center located in Kyoto University. The author thanks Prof. Shigeki Akiyama (Tsukuba Univ.) for discussions.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Aki & Hirano (2000) Aki, S. & Hirano, K. (2000), ‘Numbers of success-runs of specified length until certain stopping time rules and generalized binomial distributions of order k’, Ann. Inst. Statist. Math. 52 (4), 767–777.
- 3Aki et al. (1984) Aki, S., Kuboki, H. & Hirano, K. (1984), ‘On discrete distributions of order k’, Ann. Inst. Statist. Math. 36 , 431–440.
- 4Antzoulakos & Chadjiconstantindis (2001) Antzoulakos, D. L. & Chadjiconstantindis, S. (2001), ‘Distributions of numbers of success runs of fixed length in Markov dependent trials’, Ann. Inst. Statist. Math. 53 (3), 599–619.
- 5Balakrishnan & Koutras (2002) Balakrishnan, N. & Koutras, M. V. (2002), Runs and scans with applications , John Wiley & Sons.
- 6Bassino et al. (2010) Bassino, F., Clément, J. & Micodème, P. (2010), ‘Counting occurrences for a finite set of words: combinatorial methods’, ACM Trans. Algorithms. 9 (4), Article No. 31.
- 7Berthé & Rigo (2016) Berthé, V. & Rigo, M. (2016), Combinatorics, words and symbolic dynamics , Encyclopedia of Mathematics and Its Applications 159, Cambridge University Press.
- 8Blom & Thorburn (1982) Blom, G. & Thorburn, D. (1982), ‘How many random digits are required until given sequences are obtained?’, J. Appl. Probab. 19 (3), 518–531.
