Maximal Unbordered Factors of Random Strings

Patrick Hagge Cording; Travis Gagie; Mathias B{\ae}k Tejs; Knudsen; Tomasz Kociumaka

arXiv:1704.04472·cs.DS·December 18, 2018

Maximal Unbordered Factors of Random Strings

Patrick Hagge Cording, Travis Gagie, Mathias B{\ae}k Tejs, Knudsen, Tomasz Kociumaka

PDF

TL;DR

This paper proves that the expected maximum length of unbordered factors in a random string is close to the string length, confirming a conjecture and enabling linear-time average-case algorithms.

Contribution

It confirms a conjecture by precisely characterizing the expected maximum unbordered factor length in random strings and analyzes the average-case complexity of finding such factors.

Findings

01

Expected maximum unbordered factor length is n - Θ(σ^{-1})

02

Maximum unbordered factor can be found in linear time on average

03

Average-case complexity is between Ω(√n) and O(√n log_σ n)

Abstract

A border of a string is a non-empty prefix of the string that is also a suffix of the string, and a string is unbordered if it has no border other than itself. Loptev, Kucherov, and Starikovskaya [CPM 2015] conjectured the following: If we pick a string of length $n$ from a fixed non-unary alphabet uniformly at random, then the expected maximum length of its unbordered factors is $n - O (1)$ . We confirm this conjecture by proving that the expected value is, in fact, $n - Θ (σ^{- 1})$ , where $σ$ is the size of the alphabet. This immediately implies that we can find such a maximal unbordered factor in linear time on average. However, we go further and show that the optimum average-case running time is in $Ω (n) \cap O (n lo g_{σ} n)$ due to analogous bounds by Czumaj and G\k{a}sieniec [CPM 2000] for the problem of computing the shortest period of a…

Equations37

C (t) = \frac{σ ^{3} - σ ^{2} e ^{2 t}}{σ ^{3} - 2 σ ^{2} e ^{2 t} + e ^{4 t}} .

C (t) = \frac{σ ^{3} - σ ^{2} e ^{2 t}}{σ ^{3} - 2 σ ^{2} e ^{2 t} + e ^{4 t}} .

C (t) - M_{Δ_{n}} (t) = \frac{σ ^{3} - σ ^{2} e ^{2 t}}{σ ^{3} - 2 σ ^{2} e ^{2 t} + e ^{4 t}} - 1 = \frac{σ ^{2} e ^{2 t} - e ^{4 t}}{σ ^{3} - 2 σ ^{2} e ^{2 t} + e ^{4 t}} \geq 0.

C (t) - M_{Δ_{n}} (t) = \frac{σ ^{3} - σ ^{2} e ^{2 t}}{σ ^{3} - 2 σ ^{2} e ^{2 t} + e ^{4 t}} - 1 = \frac{σ ^{2} e ^{2 t} - e ^{4 t}}{σ ^{3} - 2 σ ^{2} e ^{2 t} + e ^{4 t}} \geq 0.

M_{Δ_{n}} (t) = E [e^{t (n - L (S))}] = ℓ = 1 \sum n P [F (S) = ℓ] \cdot E [e^{t (n - L (S))} ∣ F (S) = ℓ] .

M_{Δ_{n}} (t) = E [e^{t (n - L (S))}] = ℓ = 1 \sum n P [F (S) = ℓ] \cdot E [e^{t (n - L (S))} ∣ F (S) = ℓ] .

E [e^{t (n - L (S))} ∣ F (S) = n] = 1.

E [e^{t (n - L (S))} ∣ F (S) = n] = 1.

E [e^{t (n - L (S))} ∣ F (S) = ℓ] \leq E [e^{t (n - L (S [ℓ + 1, n - ℓ]))} ∣ F (S) = ℓ] = E [e^{t (n - L (S [ℓ + 1, n - ℓ]))}] = = e^{2 t ℓ} E [e^{t (n - 2 ℓ - L (S [ℓ + 1, n - ℓ]))}] = e^{2 t ℓ} M_{Δ_{n - 2 ℓ}} (t) .

E [e^{t (n - L (S))} ∣ F (S) = ℓ] \leq E [e^{t (n - L (S [ℓ + 1, n - ℓ]))} ∣ F (S) = ℓ] = E [e^{t (n - L (S [ℓ + 1, n - ℓ]))}] = = e^{2 t ℓ} E [e^{t (n - 2 ℓ - L (S [ℓ + 1, n - ℓ]))}] = e^{2 t ℓ} M_{Δ_{n - 2 ℓ}} (t) .

P [F (S) = ℓ] \leq {σ^{- 1} (σ - 1) σ^{- ℓ - 1} if ℓ = 1, if 2 \leq ℓ \leq \frac{1}{2} n .

P [F (S) = ℓ] \leq {σ^{- 1} (σ - 1) σ^{- ℓ - 1} if ℓ = 1, if 2 \leq ℓ \leq \frac{1}{2} n .

P [F (S) = ℓ] = 0 if \frac{1}{2} n < ℓ < n .

P [F (S) = ℓ] = 0 if \frac{1}{2} n < ℓ < n .

M_{Δ_{n}} (t)

M_{Δ_{n}} (t)

\leq 1 + σ^{- 1} \cdot e^{2 t} \cdot M_{Δ_{n - 2}} (t) + ℓ = 2 \sum ⌊ n /2 ⌋ (σ - 1) σ^{- ℓ - 1} \cdot e^{2 t ℓ} \cdot M_{Δ_{n - 2 ℓ}} (t) .

M_{Δ_{n}} (t)

M_{Δ_{n}} (t)

\leq 1 + C (t) (σ^{- 1} e^{2 t} + (σ - 1) σ^{- 3} e^{4 t} \cdot ℓ = 0 \sum \infty (σ^{- 1} e^{2 t})^{ℓ})

= 1 + C (t) (σ^{- 1} e^{2 t} + (σ - 1) σ^{- 3} e^{4 t} \cdot \frac{1}{1 - σ ^{- 1} e ^{2 t}})

= 1 + C (t) \cdot \frac{σ ( σ - e ^{2 t} ) e ^{2 t} - ( σ - 1 ) e ^{4 t}}{σ ^{2} ( σ - e ^{2 t} )}

= 1 + \frac{σ ^{3} - σ ^{2} e ^{2 t}}{σ ^{3} - 2 σ ^{2} e ^{2 t} + e ^{4 t}} \cdot \frac{σ ^{2} e ^{2 t} - e ^{4 t}}{σ ^{3} - σ ^{2} e ^{2 t}}

= \frac{σ ^{3} - 2 σ ^{2} e ^{2 t} + e ^{4 t} σ ^{2} e ^{2 t} - e ^{4 t}}{σ ^{3} - 2 σ ^{2} e ^{2 t} + e ^{4 t}}

= C (t) .

E [Δ_{n}] \leq \frac{M _{Δ_{n}} ( t ) - 1}{t} \leq \frac{C ( t ) - 1}{t} .

E [Δ_{n}] \leq \frac{M _{Δ_{n}} ( t ) - 1}{t} \leq \frac{C ( t ) - 1}{t} .

E [Δ_{n}] \leq C (1) - 1 = \frac{σ ^{2} e ^{2} - e ^{4}}{σ ^{3} - 2 σ ^{2} e ^{2} + e ^{4}} = \frac{O ( σ ^{2} )}{Ω ( σ ^{3} )} = O (σ^{- 1}) .

E [Δ_{n}] \leq C (1) - 1 = \frac{σ ^{2} e ^{2} - e ^{4}}{σ ^{3} - 2 σ ^{2} e ^{2} + e ^{4}} = \frac{O ( σ ^{2} )}{Ω ( σ ^{3} )} = O (σ^{- 1}) .

P [Δ_{n} \geq ℓ] \leq \frac{E [ e ^{t Δ_{n}} ]}{e ^{t ℓ}} = \frac{M _{Δ_{n}} ( t )}{e ^{t ℓ}} \leq \frac{C ( t )}{e ^{t ℓ}} .

P [Δ_{n} \geq ℓ] \leq \frac{E [ e ^{t Δ_{n}} ]}{e ^{t ℓ}} = \frac{M _{Δ_{n}} ( t )}{e ^{t ℓ}} \leq \frac{C ( t )}{e ^{t ℓ}} .

C (0.1 ln σ) = \frac{σ ^{3} - σ ^{2.2}}{σ ^{3} - 2 σ ^{2.2} + σ ^{0.4}} = \frac{O ( σ ^{3} )}{Ω ( σ ^{3} )} = O (1) .

C (0.1 ln σ) = \frac{σ ^{3} - σ ^{2.2}}{σ ^{3} - 2 σ ^{2.2} + σ ^{0.4}} = \frac{O ( σ ^{3} )}{Ω ( σ ^{3} )} = O (1) .

\overset{ˉ}{S}

\overset{ˉ}{S}

S^{'}

\overset{ˉ}{S}^{'}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Maximal Unbordered Factors of Random Strings††thanks: A preliminary version of this paper [3] with weaker results was presented at the 23rd Symposium on String Processing and Information Retrieval (SPIRE ‘16).

Patrick Hagge Cording Supported by the Danish Research Council under the Sapere Aude Program (DFF 4005-00267). DTU Compute, Technical University of Denmark, [email protected]

Travis Gagie Supported by FONDECYT grant 1171058. CeBiB; EIT, Universidad Diego Portales, Chile, [email protected]

Mathias Bæk Tejs Knudsen Partly supported by Mikkel Thorup‘s Advanced Grant from the Danish Council for Independent Research under the Sapere Aude research career programme and the FNU project AlgoDisc — Discrete Mathematics, Algorithms, and Data Structures. Department of Computer Science, University of Copenhagen, Denmark, [email protected]

Tomasz Kociumaka

Institute of Informatics, University of Warsaw, Poland, [email protected]

Abstract

A border of a string is a non-empty prefix of the string that is also a suffix of the string, and a string is unbordered if it has no border other than itself. Loptev, Kucherov, and Starikovskaya [CPM 2015] conjectured the following: If we pick a string of length $n$ from a fixed non-unary alphabet uniformly at random, then the expected maximum length of its unbordered factors is $n-O(1)$ . We confirm this conjecture by proving that the expected value is, in fact, ${n-\Theta(\sigma^{-1})}$ , where $\sigma$ is the size of the alphabet. This immediately implies that we can find such a maximal unbordered factor in linear time on average. However, we go further and show that the optimum average-case running time is in $\Omega(\sqrt{n})\cap O(\sqrt{n\log_{\sigma}n})$ due to analogous bounds by Czumaj and Gąsieniec [CPM 2000] for the problem of computing the shortest period of a uniformly random string.

1 Introduction

Let $\Sigma$ be a finite alphabet of size $\sigma\geq 2$ . A string $S\in\Sigma^{n}$ is a sequence $S=S[1]\cdots S[n]$ of $n$ symbols from $\Sigma$ ; the length $n$ of $S$ is denoted by $|S|$ . For $1\leq i\leq j\leq n$ , we denote $S[i,j]=S[i]\cdots S[j]$ and call the string $S[i,j]$ a factor of $S$ . A factor $S[1,j]$ is a prefix of $S$ and a factor $S[i,n]$ is a suffix of $S$ . A border of a string is a non-empty prefix of the string that is also a suffix of the string. In other words, the string $S$ has a border of length $\ell$ , $1\leq\ell\leq n$ , if and only if $S[1,\ell]=S[n-\ell+1,n]$ .

A string $S$ is unbordered if it does not have any proper border, i.e., any border other than the whole of $S$ . By $L(S)$ we denote the maximum length of unbordered factors of $S$ . Any unbordered factor of length $L(S)$ is called a maximal unbordered factor of $S$ .

An integer $p>0$ is a period of a string $S\in\Sigma^{n}$ if $S[i]=S[i+p]$ for $1\leq i\leq n-p$ . The shortest period of a string $S$ is denoted $\operatorname{per}(S)$ . Note that $p$ is a period of $S$ if and only if $S$ has a border of length $n-p$ , so $S$ is unbordered if and only if $\operatorname{per}(S)=n$ . Moreover, $\operatorname{per}(S[i,j])\leq\operatorname{per}(S)$ ; applied to a maximal unbordered factor, this yields $L(S)\leq\operatorname{per}(S)$ .

*Example 1** ([1]).*

If $S=\texttt{1011001101}$ , then $\operatorname{per}(S)=7$ and $L(S)=6$ . The maximal unbordered factors are $S[1,6]=\texttt{101100}$ and $S[5,10]=\texttt{001101}$ .

Unbordered factors were first studied by Ehrenfeucht and Silberger [6], with emphasis on the relationship $\operatorname{per}(S)$ and $L(S)$ . The question when $\operatorname{per}(S)=L(S)$ received more attention in the literature [1, 5, 9, 8]. For strings $S\in\Sigma^{n}$ , the equality holds if $L(S)\leq\frac{3}{7}n$ [9] or $\operatorname{per}(S)\leq\frac{1}{2}n$ [6].

Loptev, Kucherov, and Starikovskaya [15] proved that for uniformly random string $S\in\Sigma^{n}$ over an alphabet $\Sigma$ of size $\sigma\geq 2$ the expected maximum length $\operatorname{E}[L(S)]$ of unbordered factors is at least ${n(1-\xi(\sigma)\cdot\sigma^{-4})}.+O(1)$ , where $\xi(\sigma)$ converges to $2$ as $\sigma$ grows. When $\sigma\geq 5$ and $n$ is sufficiently large, their bound implies $\operatorname{E}[L(S)]\geq 0.99n$ . Supported by experimental results, Loptev et al. [15] conjectured that $\operatorname{E}[L(S)]=n-O(1)$ . In Section 2, we confirm this conjecture and prove that the tail of $n-L(S)$ decays exponentially.

Theorem 2.

Let $S\in\Sigma^{n}$ be a uniformly random string over an alphabet $\Sigma$ of size $\sigma\geq 2$ .

(a)

$\operatorname{E}[L(S)]=n-O(\sigma^{-1})$ . 2. (b)

For each $\delta>0$ , the probability of $L(S)=n-O(\log_{\sigma}\delta^{-1})$ is at least $1-\delta$ .

One can easily deduce that $\operatorname{per}(S)\geq L(S)$ also satisfies both claims of Theorem 2. However, a recent study by Holub and Shallit [10] provides much stronger results concerning the shortest periods of uniformly random strings.

The problem of computing a maximal unbordered factor of a uniformly random string was studied by Loptev et al. [15] and Gawrychowski et al. [7], who gave algorithms with average-case running times of $O(\frac{n^{2}}{\sigma^{4}}+n)$ and $O(n\log n)$ , respectively. The solution by Loptev et al. [15, Theorem 3] actually takes $O(n(n-L(S)+1))$ worst-case time. By Theorem 2(a), its average-case running time is therefore $O(n)$ . Nevertheless, this is still much worse than what is necessary to compute the shortest period of a uniformly random string [4]. To address this issue, in Section 3 we develop a pair of reductions using Theorem 2(b) to show that computing $L(S)$ and $\operatorname{per}(S)$ is equivalent with respect to the average-case running time.

Theorem 3.

Let $S\in\Sigma^{n}$ be a uniformly random string over an alphabet $\Sigma$ of size $\sigma$ .

(a)

The problem of computing $L(S)$ can be reduced in $O(\log_{\sigma}n)$ expected time to the problem of computing $\operatorname{per}(S^{\prime})$ for a fixed factor $S^{\prime}$ of $S$ . 2. (b)

The problem of computing $\operatorname{per}(S)$ can be reduced in $O(1)$ expected time to the problem of computing $L(S)$ .

Consequently, the $\Omega(\sqrt{n})$ and $O(\sqrt{n\log_{\sigma}n})$ lower and upper bounds known for computing the shortest period of a uniformly random string, both due to Czumaj and Gąsieniec [4], carry over to computing a maximal unbordered factor of such a string.

Corollary 4.

The problem of computing a maximal unbordered factor of a uniformly random string over an alphabet $\Sigma$ of size $\sigma$ takes $O(\sqrt{n\log_{\sigma}n})$ time on average, and this bound is within an $O(\sqrt{\log_{\sigma}n})$ factor of optimal.

Czumaj and Gąsieniec also conjectured that the optimum average-case running time of computing the shortest period is $\Theta(\sqrt{n\log_{\sigma}n})$ ; any resolution of this conjecture automatically transfers to maximal unbordered factors.

The worst-case running time we get from Theorem 3 and Czumaj and Gąsieniec‘s work [4] is $O(n^{2})$ . However, to obtain state-of-the-art running time both in the average case and in the worst case, we can dovetail our solution with any of the worst-case algorithms for computing a maximal unbordered factor. Gawrychowski et al. [7] gave such an algorithm with the running time $O(n^{1.5})$ . Very recently, this has been improved [12] to $O(n\log n\log^{2}\log n)$ (and further to $O(n\log n)$ if one allows Las Vegas randomization). Nevertheless, this is still slower than the $O(n)$ time needed to compute the shortest period in the worst-case [16, 11].

Data structures for answering a period queries have also recently been developed. Such a query takes two indices $i$ and $j$ and the answer is the shortest period $\operatorname{per}(S[i,j])$ . Kociumaka et al. [14] developed a data structure of size $O(n)$ answering period queries in $O(\log n)$ time, which improved upon several earlier time-space trade-offs they presented in an earlier paper [13]. Computing $L(S[i,j])$ for a given factor $S[i,j]$ appears to be a much more difficult task.

Another interesting possibility is to extend our results from average-case analysis to smoothed analysis [17, 18, 2], in which the input can be chosen adversarially but some random noise is then added to it. We conjecture that when the noise level is reasonably large — e.g., each symbol is replaced by a randomly chosen one with some positive constant probability — then our bounds do not change significantly. Our results or techniques could also be applicable to other problems concerning borders and periods.

2 Distribution of Maximum Length of Unbordered Factors

Let us fix an alphabet $\Sigma$ of size $\sigma\geq 2$ . For every $n\geq 0$ , we define a random variable $\Delta_{n}$ distributed as $|S|-L(S)$ for uniformly random $S\in\Sigma^{n}$ . The following lemma, which gives a common upper bound of the moment-generating functions $M_{\Delta_{n}}(t)=\operatorname{E}[e^{t\Delta_{n}}]$ , is the key tool behind Theorem 2.

Lemma 5.

For $n\in\mathbb{N}$ and $0\leq t\leq 0.1\ln\sigma$ , we have $M_{\Delta_{n}}(t)\leq C(t)$ , where

[TABLE]

Proof.

We proceed by induction on $n$ . The base case is $n\in\{0,1\}$ for which $\Delta_{n}=0$ and therefore $M_{\Delta_{n}}(t)=1$ . Consequently, we need to prove that

[TABLE]

Note that the denominator is a quadratic function of $e^{2t}$ with a minimum at $e^{2t}=\sigma^{2}$ . Hence, $\sigma^{3}-2\sigma^{2}e^{2t}+e^{4t}\geq\sigma^{3}-2\sigma^{2.2}+\sigma^{0.4}$ for $t\leq 0.1\ln\sigma$ . The right-hand side is a polynomial of $\sigma^{0.2}$ , and one can easily verify that it is positive for $\sigma\geq 2$ . Consequently, the denominator is positive. To complete the proof of the base case, observe that $e^{2t}(\sigma^{2}-e^{2t})$ is also positive for $t\leq\ln\sigma$ .

For $n\geq 2$ , we assume $M_{\Delta_{m}}(t)\leq C(t)$ for $m<n$ and $0\leq t\leq 0.1\ln\sigma$ . We consider a uniformly random $S\in\Sigma^{n}$ and condition over the possible lengths $\ell$ of the shortest border of $S$ . More formally, we define $F(S)$ as the smallest integer $\ell>0$ such that $S[1,\ell]=S[n-\ell+1,n]$ , and we write

[TABLE]

Now, we bound from above individual terms of this sum. Observe that $F(S)=n$ is equivalent to $L(S)=n$ and therefore

[TABLE]

For $\ell\leq\frac{1}{2}n$ , we observe that $S[\ell+1,n-\ell]$ is independent from $F(S)=\ell$ . Due to $L(S)\geq L(S[\ell+1,n-\ell])$ , this yields

[TABLE]

Moreover, we note that $F(S)=\ell$ implies $S[i]=S[n-\ell+i]$ for $1\leq i\leq\ell$ and these events are independent. For $\ell\geq 2$ , we have one more independent event $S[1]\neq S[\ell]$ due to $F(S)\neq 1$ . Consequently,

[TABLE]

In the remaining case of $\tfrac{1}{2}n<\ell<n$ , we observe that if $S[1,\ell]=S[n-\ell+1,n]$ , then $S[n-\ell+1,\ell]$ is also a border of $S$ . This contradicts $F(S)=\ell$ because $|S[n-\ell+1,\ell]|=2\ell-n<\ell$ . Consequently,

[TABLE]

Plugging (3–6) into (2), we obtain

[TABLE]

The inductive assumption further yields

[TABLE]

This completes the proof of Lemma 5. ∎

Next, let us focus on the expected value $\operatorname{E}[\Delta_{n}]$ . Note that $M_{\Delta_{n}}(t)=\operatorname{E}[e^{t\Delta_{n}}]\geq\operatorname{E}[1+t\Delta_{n}]$ . Consequently, for $0<t\leq 0.1\ln\sigma$ we have

[TABLE]

Hence, $\operatorname{E}[\Delta_{n}]$ is bounded by a function of $\sigma$ independent of $n$ . To analyze its asymptotics in terms of $\sigma$ , we plug $t=1$ (valid for $\sigma\geq e^{10}$ ), which yields

[TABLE]

This completes the proof of Theorem 2(a).

For the claim (b), we apply Markov‘s inequality on top of Lemma 5:

[TABLE]

Hence, it suffices to take $\ell\geq 10\log_{\sigma}(\delta^{-1}\cdot C(0.1\ln\sigma))$ to make sure that the probability does not exceed $\delta$ . To complete the proof, observe that

[TABLE]

3 Average-Case Algorithms for Maximal Unbordered Factors

In this section, we give a pair of reductions between the problems of computing the shortest period and the maximum length of unbordered factors of a uniformly random string, thereby proving Theorem 3. We assume that the alphabet $\Sigma$ is of size $\sigma\geq 2$ . Otherwise, both values are always 1.

We start with a simple argument showing Theorem 3(b). Suppose that we aim at computing $\operatorname{per}(S)$ for a uniformly random string $S\in\Sigma^{n}$ . Having determined $L(S)$ , we rely on the fact that $\operatorname{per}(S)\geq L(S)$ . We construct a string $S_{\$ }:=S[1,n-L(S)]$S[L(S)+1,n] $, where$ $\notin\Sigma $is a sentinel symbol, and observe that$ S $has a border of length$ \ell\leq n-L(S) $if and only if$ S_{$} $has such a border. Moreover, the presence of the sentinel symbol guarantees that$ S_{$} $does not have proper borders longer than$ n-L(S) $. Consequently, we have$ |S|-\operatorname{per}(S)=|S_{$}|-\operatorname{per}(S_{$}) $. The value$ \operatorname{per}(S_{$}) $can be computed using a worst-case algorithm [[16](#bib.bib16), [11](#bib.bib11)], which takes$ O(|S_{$}|)=O(n-L(S)+1) $time. The expected running time of the reduction is$ O(1)$ due to Theorem 2(a).

We proceed with a proof of Theorem 3(a). Suppose that we aim at computing $L(S)$ for a uniformly random string $S\in\Sigma^{n}$ . We apply Theorem 2(b) for $\delta=\frac{1}{n^{2}}$ to obtain a value $d=O(\log_{\sigma}n)$ such that $\operatorname{P}[|T|-L(T)\geq d]\leq\frac{1}{n^{2}}$ for uniformly random strings $T\in\Sigma^{m}$ of arbitrary length $m$ . Note that this also yields $\operatorname{P}[|T|-\operatorname{per}(T)\geq d]\leq\frac{1}{n^{2}}$ due to $\operatorname{per}(T)\geq L(T)$ .

If $n\leq 6d$ , we simply determine $L(S)$ using Loptev et al.‘s algorithm [15], which takes $O(d)=O(\log_{\sigma}n)$ time on average. Otherwise, we construct three strings

[TABLE]

and we compute $|\bar{S}|-L(\bar{S})$ , $|S^{\prime}|-\operatorname{per}(S^{\prime})$ , and $|\bar{S}^{\prime}|-\operatorname{per}(\bar{S}^{\prime})$ . If any of these values exceeds $d$ , we fall back to the algorithm of [15] to compute $L(S)$ . Otherwise, we determine $L(S)$ based on $|S|-L(S)=|\bar{S}|-L(|\bar{S}|)$ .

Before proving this equality, let us analyze the running time of the reduction. Observe that $\bar{S}$ , $S^{\prime}$ , and $\bar{S}^{\prime}$ are uniformly random strings of the respective lengths, which lets us use average-case algorithms. In particular, it takes $O(d)$ time on average to compute $L(\bar{S}^{\prime})$ using Loptev et al.‘s algorithm [15]. Determining $\operatorname{per}(S^{\prime})$ is the target of the reduction, so we do not include it in the analysis. The value $\operatorname{per}(\bar{S}^{\prime})$ is computed in $O(d)$ worst-case time [16, 11]. The probability of a fall-back is at most $\frac{3}{n^{2}}$ by the choice of $d$ , which compensates for the worst-case111Note that we cannot use the average-case bound of $O(n)$ because the conditional distribution of $S$ (in case of a fall-back) is no longer uniform across $\Sigma^{n}$ . time $O(n^{2})$ it takes to apply Loptev et al.‘s algorithm to the whole of $S$ . Overall, the reduction works in $O(d)=O(\log_{\sigma}n)$ time on average.

It remains to prove $|S|-L(S)=|\bar{S}|-L(\bar{S})$ provided that $|\bar{S}|-L(\bar{S})\leq d$ , $|S^{\prime}|-\operatorname{per}(S^{\prime})\leq d$ , and $|\bar{S}^{\prime}|-\operatorname{per}(\bar{S}^{\prime})\leq d$ . First, consider a maximal unbordered factor of $\bar{S}$ . It must be of the form $S[i,3d]S[n-3d+1,j]$ for some $1\leq i\leq d$ and $n-d+1\leq j\leq n$ , and we claim that $S[i,j]$ is then an unbordered factor of $S$ . For a proof by contradiction, suppose that $S[i,j]$ has a proper border and the longest such border is of length $\ell$ . Note that $\ell>\min(|S[i,3d]|,|S[n-3d+1,j]|)$ because $S[i,3d]S[n-3d+1,j]$ is unbordered. We conclude that $\operatorname{per}(S[i,j])=|S[i,j]|-\ell<n-3d$ . However, this yields $\operatorname{per}(S^{\prime})\leq\operatorname{per}(S[i,j])<n-3d=|S^{\prime}|-d$ , a contradiction. Consequently, $|S|-L(S)\leq|\bar{S}|-L(\bar{S})$ .

The proof of $|S|-L(S)\geq|\bar{S}|-L(\bar{S})$ is symmetric. We consider a maximal unbordered factor $S[i,j]$ of $S$ , observe that $1\leq i\leq d$ and $n-d+1\leq j\leq n$ due to $|S|-L(S)\leq d$ , and claim that $S[i,3d]S[n-3d+1,j]$ is unbordered For a proof by contradiction we suppose that it a border of length $\ell$ . We note that $\ell>\min(|S[i,3d]|,|S[n-3d+1,j]|)$ because $S[i,j]$ is unbordered and derive $\operatorname{per}(\bar{S}^{\prime})\leq\operatorname{per}(S[i,3d]S[n-3d+1,j])<3d$ , which contradicts $\operatorname{per}(\bar{S}^{\prime})\geq|\bar{S}^{\prime}|-d=3d$ .

This completes the proof of Theorem 3(a).

Acknowledgments

Many thanks to Danny Hucke for asking about the possibility of a sublinear average-case algorithm at the presentation of the conference version of this paper, and to the anonymous reviewers for their comments.

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Roland Assous and Maurice Pouzet. Une caractérisation des mots periodiques. Discrete Mathematics , 25(1):1–5, 1979. doi:10.1016/0012-365X(79)90146-8 . · doi ↗
2[2] Christina Boucher and Kathleen Wilkie. Why large closest string instances are easy to solve in practice. In Edgar Chávez and Stefano Lonardi, editors, String Processing and Information Retrieval, SPIRE 2010 , volume 6393 of LNCS , pages 106–117. Springer, 2010. doi:10.1007/978-3-642-16321-0˙10 . · doi ↗
3[3] Patrick Hagge Cording and Mathias Bæk Tejs Knudsen. Maximal unbordered factors of random strings. In Shunsuke Inenaga, Kunihiko Sadakane, and Tetsuya Sakai, editors, String Processing and Information Retrieval, SPIRE 2016 , volume 9954 of LNCS , pages 93–96, 2016. doi:10.1007/978-3-319-46049-9˙9 . · doi ↗
4[4] Artur Czumaj and Leszek Gąsieniec. On the complexity of determining the period of a string. In Raffaele Giancarlo and David Sankoff, editors, Combinatorial Pattern Matching, CPM 2000 , volume 1848 of LNCS , pages 412–422. Springer, 2000. doi:10.1007/3-540-45123-4˙34 . · doi ↗
5[5] Jean-Pierre Duval. Relationship between the period of a finite word and the length of its unbordered segments. Discrete Mathematics , 40(1):31–44, 1982. doi:10.1016/0012-365X(82)90186-8 . · doi ↗
6[6] Andrzej Ehrenfeucht and D. M. Silberger. Periodicity and unbordered segments of words. Discrete Mathematics , 26(2):101–109, 1979. doi:10.1016/0012-365X(79)90116-X . · doi ↗
7[7] Paweł Gawrychowski, Gregory Kucherov, Benjamin Sach, and Tatiana Starikovskaya. Computing the longest unbordered substring. In Costas S. Iliopoulos, Simon J. Puglisi, and Emine Yilmaz, editors, String Processing and Information Retrieval, SPIRE 2015 , volume 9309 of LNCS , pages 246–257. Springer, 2015. doi:10.1007/978-3-319-23826-5˙24 . · doi ↗
8[8] Tero Harju and Dirk Nowotka. Periodicity and unbordered words: A proof of the extended Duval conjecture. Journal of the ACM , 54(4):20, 2007. doi:10.1145/1255443.1255448 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Maximal Unbordered Factors of Random Strings††thanks: A preliminary version of this paper [3] with weaker results was presented at the 23rd Symposium on String Processing and Information Retrieval (SPIRE ‘16).

Abstract

1 Introduction

Example 1* ([1]).*

Theorem 2**.**

Theorem 3**.**

Corollary 4**.**

2 Distribution of Maximum Length of Unbordered Factors

Lemma 5**.**

Proof.

3 Average-Case Algorithms for Maximal Unbordered Factors

Acknowledgments

*Example 1** ([1]).*

Theorem 2.

Theorem 3.

Corollary 4.

Lemma 5.