Maximal Unbordered Factors of Random Strings
Patrick Hagge Cording, Travis Gagie, Mathias B{\ae}k Tejs, Knudsen, Tomasz Kociumaka

TL;DR
This paper proves that the expected maximum length of unbordered factors in a random string is close to the string length, confirming a conjecture and enabling linear-time average-case algorithms.
Contribution
It confirms a conjecture by precisely characterizing the expected maximum unbordered factor length in random strings and analyzes the average-case complexity of finding such factors.
Findings
Expected maximum unbordered factor length is n - Θ(σ^{-1})
Maximum unbordered factor can be found in linear time on average
Average-case complexity is between Ω(√n) and O(√n log_σ n)
Abstract
A border of a string is a non-empty prefix of the string that is also a suffix of the string, and a string is unbordered if it has no border other than itself. Loptev, Kucherov, and Starikovskaya [CPM 2015] conjectured the following: If we pick a string of length from a fixed non-unary alphabet uniformly at random, then the expected maximum length of its unbordered factors is . We confirm this conjecture by proving that the expected value is, in fact, , where is the size of the alphabet. This immediately implies that we can find such a maximal unbordered factor in linear time on average. However, we go further and show that the optimum average-case running time is in due to analogous bounds by Czumaj and G\k{a}sieniec [CPM 2000] for the problem of computing the shortest period of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Maximal Unbordered Factors of Random Strings††thanks: A preliminary version of this paper [3] with weaker results was presented at the 23rd Symposium on String Processing and Information Retrieval (SPIRE ‘16).
Patrick Hagge Cording Supported by the Danish Research Council under the Sapere Aude Program (DFF 4005-00267). DTU Compute, Technical University of Denmark, [email protected]
Travis Gagie Supported by FONDECYT grant 1171058. CeBiB; EIT, Universidad Diego Portales, Chile, [email protected]
Mathias Bæk Tejs Knudsen Partly supported by Mikkel Thorup‘s Advanced Grant from the Danish Council for Independent Research under the Sapere Aude research career programme and the FNU project AlgoDisc — Discrete Mathematics, Algorithms, and Data Structures. Department of Computer Science, University of Copenhagen, Denmark, [email protected]
Tomasz Kociumaka
Institute of Informatics, University of Warsaw, Poland, [email protected]
Abstract
A border of a string is a non-empty prefix of the string that is also a suffix of the string, and a string is unbordered if it has no border other than itself. Loptev, Kucherov, and Starikovskaya [CPM 2015] conjectured the following: If we pick a string of length from a fixed non-unary alphabet uniformly at random, then the expected maximum length of its unbordered factors is . We confirm this conjecture by proving that the expected value is, in fact, , where is the size of the alphabet. This immediately implies that we can find such a maximal unbordered factor in linear time on average. However, we go further and show that the optimum average-case running time is in due to analogous bounds by Czumaj and Gąsieniec [CPM 2000] for the problem of computing the shortest period of a uniformly random string.
1 Introduction
Let be a finite alphabet of size . A string is a sequence of symbols from ; the length of is denoted by . For , we denote and call the string a factor of . A factor is a prefix of and a factor is a suffix of . A border of a string is a non-empty prefix of the string that is also a suffix of the string. In other words, the string has a border of length , , if and only if .
A string is unbordered if it does not have any proper border, i.e., any border other than the whole of . By we denote the maximum length of unbordered factors of . Any unbordered factor of length is called a maximal unbordered factor of .
An integer is a period of a string if for . The shortest period of a string is denoted . Note that is a period of if and only if has a border of length , so is unbordered if and only if . Moreover, ; applied to a maximal unbordered factor, this yields .
Example 1* ([1]).*
If , then and . The maximal unbordered factors are and .
Unbordered factors were first studied by Ehrenfeucht and Silberger [6], with emphasis on the relationship and . The question when received more attention in the literature [1, 5, 9, 8]. For strings , the equality holds if [9] or [6].
Loptev, Kucherov, and Starikovskaya [15] proved that for uniformly random string over an alphabet of size the expected maximum length of unbordered factors is at least , where converges to as grows. When and is sufficiently large, their bound implies . Supported by experimental results, Loptev et al. [15] conjectured that . In Section 2, we confirm this conjecture and prove that the tail of decays exponentially.
Theorem 2**.**
Let be a uniformly random string over an alphabet of size .
- (a)
. 2. (b)
For each , the probability of is at least .
One can easily deduce that also satisfies both claims of Theorem 2. However, a recent study by Holub and Shallit [10] provides much stronger results concerning the shortest periods of uniformly random strings.
The problem of computing a maximal unbordered factor of a uniformly random string was studied by Loptev et al. [15] and Gawrychowski et al. [7], who gave algorithms with average-case running times of and , respectively. The solution by Loptev et al. [15, Theorem 3] actually takes worst-case time. By Theorem 2(a), its average-case running time is therefore . Nevertheless, this is still much worse than what is necessary to compute the shortest period of a uniformly random string [4]. To address this issue, in Section 3 we develop a pair of reductions using Theorem 2(b) to show that computing and is equivalent with respect to the average-case running time.
Theorem 3**.**
Let be a uniformly random string over an alphabet of size .
- (a)
The problem of computing can be reduced in expected time to the problem of computing for a fixed factor of . 2. (b)
The problem of computing can be reduced in expected time to the problem of computing .
Consequently, the and lower and upper bounds known for computing the shortest period of a uniformly random string, both due to Czumaj and Gąsieniec [4], carry over to computing a maximal unbordered factor of such a string.
Corollary 4**.**
The problem of computing a maximal unbordered factor of a uniformly random string over an alphabet of size takes time on average, and this bound is within an factor of optimal.
Czumaj and Gąsieniec also conjectured that the optimum average-case running time of computing the shortest period is ; any resolution of this conjecture automatically transfers to maximal unbordered factors.
The worst-case running time we get from Theorem 3 and Czumaj and Gąsieniec‘s work [4] is . However, to obtain state-of-the-art running time both in the average case and in the worst case, we can dovetail our solution with any of the worst-case algorithms for computing a maximal unbordered factor. Gawrychowski et al. [7] gave such an algorithm with the running time . Very recently, this has been improved [12] to (and further to if one allows Las Vegas randomization). Nevertheless, this is still slower than the time needed to compute the shortest period in the worst-case [16, 11].
Data structures for answering a period queries have also recently been developed. Such a query takes two indices and and the answer is the shortest period . Kociumaka et al. [14] developed a data structure of size answering period queries in time, which improved upon several earlier time-space trade-offs they presented in an earlier paper [13]. Computing for a given factor appears to be a much more difficult task.
Another interesting possibility is to extend our results from average-case analysis to smoothed analysis [17, 18, 2], in which the input can be chosen adversarially but some random noise is then added to it. We conjecture that when the noise level is reasonably large — e.g., each symbol is replaced by a randomly chosen one with some positive constant probability — then our bounds do not change significantly. Our results or techniques could also be applicable to other problems concerning borders and periods.
2 Distribution of Maximum Length of Unbordered Factors
Let us fix an alphabet of size . For every , we define a random variable distributed as for uniformly random . The following lemma, which gives a common upper bound of the moment-generating functions , is the key tool behind Theorem 2.
Lemma 5**.**
For and , we have , where
[TABLE]
Proof.
We proceed by induction on . The base case is for which and therefore . Consequently, we need to prove that
[TABLE]
Note that the denominator is a quadratic function of with a minimum at . Hence, for . The right-hand side is a polynomial of , and one can easily verify that it is positive for . Consequently, the denominator is positive. To complete the proof of the base case, observe that is also positive for .
For , we assume for and . We consider a uniformly random and condition over the possible lengths of the shortest border of . More formally, we define as the smallest integer such that , and we write
[TABLE]
Now, we bound from above individual terms of this sum. Observe that is equivalent to and therefore
[TABLE]
For , we observe that is independent from . Due to , this yields
[TABLE]
Moreover, we note that implies for and these events are independent. For , we have one more independent event due to . Consequently,
[TABLE]
In the remaining case of , we observe that if , then is also a border of . This contradicts because . Consequently,
[TABLE]
Plugging (3–6) into (2), we obtain
[TABLE]
The inductive assumption further yields
[TABLE]
This completes the proof of Lemma 5. ∎
Next, let us focus on the expected value . Note that . Consequently, for we have
[TABLE]
Hence, is bounded by a function of independent of . To analyze its asymptotics in terms of , we plug (valid for ), which yields
[TABLE]
This completes the proof of Theorem 2(a).
For the claim (b), we apply Markov‘s inequality on top of Lemma 5:
[TABLE]
Hence, it suffices to take to make sure that the probability does not exceed . To complete the proof, observe that
[TABLE]
3 Average-Case Algorithms for Maximal Unbordered Factors
In this section, we give a pair of reductions between the problems of computing the shortest period and the maximum length of unbordered factors of a uniformly random string, thereby proving Theorem 3. We assume that the alphabet is of size . Otherwise, both values are always 1.
We start with a simple argument showing Theorem 3(b). Suppose that we aim at computing for a uniformly random string . Having determined , we rely on the fact that . We construct a string S_{\}:=S[1,n-L(S)]$S[L(S)+1,n]$\notin\SigmaS\ell\leq n-L(S)S_{$}S_{$}n-L(S)|S|-\operatorname{per}(S)=|S_{$}|-\operatorname{per}(S_{$})\operatorname{per}(S_{$})can be computed using a worst-case algorithm [[16](#bib.bib16), [11](#bib.bib11)], which takesO(|S_{$}|)=O(n-L(S)+1)O(1)$ due to Theorem 2(a).
We proceed with a proof of Theorem 3(a). Suppose that we aim at computing for a uniformly random string . We apply Theorem 2(b) for to obtain a value such that for uniformly random strings of arbitrary length . Note that this also yields due to .
If , we simply determine using Loptev et al.‘s algorithm [15], which takes time on average. Otherwise, we construct three strings
[TABLE]
and we compute , , and . If any of these values exceeds , we fall back to the algorithm of [15] to compute . Otherwise, we determine based on .
Before proving this equality, let us analyze the running time of the reduction. Observe that , , and are uniformly random strings of the respective lengths, which lets us use average-case algorithms. In particular, it takes time on average to compute using Loptev et al.‘s algorithm [15]. Determining is the target of the reduction, so we do not include it in the analysis. The value is computed in worst-case time [16, 11]. The probability of a fall-back is at most by the choice of , which compensates for the worst-case111Note that we cannot use the average-case bound of because the conditional distribution of (in case of a fall-back) is no longer uniform across . time it takes to apply Loptev et al.‘s algorithm to the whole of . Overall, the reduction works in time on average.
It remains to prove provided that , , and . First, consider a maximal unbordered factor of . It must be of the form for some and , and we claim that is then an unbordered factor of . For a proof by contradiction, suppose that has a proper border and the longest such border is of length . Note that because is unbordered. We conclude that . However, this yields , a contradiction. Consequently, .
The proof of is symmetric. We consider a maximal unbordered factor of , observe that and due to , and claim that is unbordered For a proof by contradiction we suppose that it a border of length . We note that because is unbordered and derive , which contradicts .
This completes the proof of Theorem 3(a).
Acknowledgments
Many thanks to Danny Hucke for asking about the possibility of a sublinear average-case algorithm at the presentation of the conference version of this paper, and to the anonymous reviewers for their comments.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Roland Assous and Maurice Pouzet. Une caractérisation des mots periodiques. Discrete Mathematics , 25(1):1–5, 1979. doi:10.1016/0012-365X(79)90146-8 . · doi ↗
- 2[2] Christina Boucher and Kathleen Wilkie. Why large closest string instances are easy to solve in practice. In Edgar Chávez and Stefano Lonardi, editors, String Processing and Information Retrieval, SPIRE 2010 , volume 6393 of LNCS , pages 106–117. Springer, 2010. doi:10.1007/978-3-642-16321-0˙10 . · doi ↗
- 3[3] Patrick Hagge Cording and Mathias Bæk Tejs Knudsen. Maximal unbordered factors of random strings. In Shunsuke Inenaga, Kunihiko Sadakane, and Tetsuya Sakai, editors, String Processing and Information Retrieval, SPIRE 2016 , volume 9954 of LNCS , pages 93–96, 2016. doi:10.1007/978-3-319-46049-9˙9 . · doi ↗
- 4[4] Artur Czumaj and Leszek Gąsieniec. On the complexity of determining the period of a string. In Raffaele Giancarlo and David Sankoff, editors, Combinatorial Pattern Matching, CPM 2000 , volume 1848 of LNCS , pages 412–422. Springer, 2000. doi:10.1007/3-540-45123-4˙34 . · doi ↗
- 5[5] Jean-Pierre Duval. Relationship between the period of a finite word and the length of its unbordered segments. Discrete Mathematics , 40(1):31–44, 1982. doi:10.1016/0012-365X(82)90186-8 . · doi ↗
- 6[6] Andrzej Ehrenfeucht and D. M. Silberger. Periodicity and unbordered segments of words. Discrete Mathematics , 26(2):101–109, 1979. doi:10.1016/0012-365X(79)90116-X . · doi ↗
- 7[7] Paweł Gawrychowski, Gregory Kucherov, Benjamin Sach, and Tatiana Starikovskaya. Computing the longest unbordered substring. In Costas S. Iliopoulos, Simon J. Puglisi, and Emine Yilmaz, editors, String Processing and Information Retrieval, SPIRE 2015 , volume 9309 of LNCS , pages 246–257. Springer, 2015. doi:10.1007/978-3-319-23826-5˙24 . · doi ↗
- 8[8] Tero Harju and Dirk Nowotka. Periodicity and unbordered words: A proof of the extended Duval conjecture. Journal of the ACM , 54(4):20, 2007. doi:10.1145/1255443.1255448 . · doi ↗
