On the Variance of the Length of the Longest Common Subsequences in   Random Words With an Omitted Letter

Christian Houdr\'e; Qingqing Liu

arXiv:1812.09552·math.PR·December 27, 2018

On the Variance of the Length of the Longest Common Subsequences in Random Words With an Omitted Letter

Christian Houdr\'e, Qingqing Liu

PDF

Open Access

TL;DR

This paper analyzes the variance of the longest common subsequence length between two random words, where one contains an extra letter with a certain probability, showing the variance grows linearly with word length.

Contribution

It establishes that the variance of the LCS length is linear in the size of the words in a setting with an omitted letter and probabilistic letter distributions.

Findings

01

Variance of LCS length is linear in n.

02

The presence of an extra letter affects the variance growth.

03

Results extend understanding of LCS behavior in non-uniform random words.

Abstract

We investigate the variance of the length of the longest common subsequences of two independent random words of size $n$ , where the letters of one word are i.i.d. uniformly drawn from ${α_{1}, α_{2}, \dots, α_{m}}$ , while the letters of the other word are i.i.d. drawn from ${α_{1}, α_{2}, \dots, α_{m}, α_{m + 1}}$ , with probability $p > 0$ to be $α_{m + 1}$ , and $(1 - p) / m > 0$ for all the other letters. The order of the variance of this length is shown to be linear in $n$ .

Equations345

γ_{m}^{*} = n \to \infty lim \frac{E L C _{n}}{n},

γ_{m}^{*} = n \to \infty lim \frac{E L C _{n}}{n},

Var L C_{n} \leq n (1 - k = 1 \sum m p_{k}^{2}) .

Var L C_{n} \leq n (1 - k = 1 \sum m p_{k}^{2}) .

P (X_{1} = α_{1}) = \dots = P (X_{1} = α_{m}) = \frac{1 - p}{m} > 0, P (X_{1} = α_{m + 1}) = p > 0,

P (X_{1} = α_{1}) = \dots = P (X_{1} = α_{m}) = \frac{1 - p}{m} > 0, P (X_{1} = α_{m + 1}) = p > 0,

P (Y_{1} = α_{1}) = \dots = P (Y_{1} = α_{m}) = \frac{1}{m} .

P (Y_{1} = α_{1}) = \dots = P (Y_{1} = α_{m}) = \frac{1}{m} .

Var L C_{n} \leq \frac{n}{2} (2 - p^{2} - \frac{1 + ( 1 - p ) ^{2}}{m}),

Var L C_{n} \leq \frac{n}{2} (2 - p^{2} - \frac{1 + ( 1 - p ) ^{2}}{m}),

Var S \leq \frac{1}{2} i = 1 \sum n E (S - S_{i})^{2},

Var S \leq \frac{1}{2} i = 1 \sum n E (S - S_{i})^{2},

E ∣ L C_{n} - L C_{n} (X_{1} \dots X_{i - 1} \hat{X_{i}} X_{i + 1} \dots X_{n}; Y_{1} \dots Y_{n}) ∣^{2}

E ∣ L C_{n} - L C_{n} (X_{1} \dots X_{i - 1} \hat{X_{i}} X_{i + 1} \dots X_{n}; Y_{1} \dots Y_{n}) ∣^{2}

= E (∣ L C_{n} - L C_{n} (X_{1} \dots X_{i - 1} \hat{X_{i}} X_{i + 1} \dots X_{n}; Y_{1} \dots Y_{n}) ∣^{2} 1_{X_{i} \neq = \hat{X_{i}}})

\leq P (X_{i} \neq = \hat{X_{i}}) = 1 - i = 1 \sum m + 1 (P (X_{1} = α_{i}))^{2}

= 1 - m (\frac{1 - p}{m})^{2} - p^{2}

= (1 - p) (1 - \frac{1}{m} + p (1 + \frac{1}{m})),

E ∣ L C_{n} - L C_{n} (X_{1} \dots X_{n}; Y_{1} \dots Y_{i - 1} \hat{Y_{i}} Y_{i + 1} \dots Y_{n}) ∣^{2}

E ∣ L C_{n} - L C_{n} (X_{1} \dots X_{n}; Y_{1} \dots Y_{i - 1} \hat{Y_{i}} Y_{i + 1} \dots Y_{n}) ∣^{2}

= 1 - \frac{1}{m} .

Var L C_{n}

Var L C_{n}

= \frac{n}{2} (2 - p^{2} - \frac{1 + ( 1 - p ) ^{2}}{m}) .

Var L C_{n} \geq C n .

Var L C_{n} \geq C n .

Z_{j}^{k + 1} = Z_{j}^{k} .

Z_{j}^{k + 1} = Z_{j}^{k} .

Z_{j}^{k + 1} = U_{k + 1} .

Z_{j}^{k + 1} = U_{k + 1} .

Z_{j}^{k + 1} = Z_{j - 1}^{k} .

Z_{j}^{k + 1} = Z_{j - 1}^{k} .

Z^{(k)} = d (\tilde{X}^{(n)} ∣ N = n - k),

Z^{(k)} = d (\tilde{X}^{(n)} ∣ N = n - k),

Z^{(n - N)} = d \tilde{X}^{(n)},

Z^{(n - N)} = d \tilde{X}^{(n)},

Z^{(k)} = d (\tilde{X}^{(n)} ∣ N = n - k), 2 \leq k \leq n - 1,

Z^{(k)} = d (\tilde{X}^{(n)} ∣ N = n - k), 2 \leq k \leq n - 1,

P ((Z_{1}^{k}, Z_{2}^{k}, \dots, Z_{k}^{k}) = (α_{j_{1}}, α_{j_{2}}, \dots, α_{j_{k}})) = (\frac{1}{m})^{k} .

P ((Z_{1}^{k}, Z_{2}^{k}, \dots, Z_{k}^{k}) = (α_{j_{1}}, α_{j_{2}}, \dots, α_{j_{k}})) = (\frac{1}{m})^{k} .

P ((Z_{1}^{k + 1}, Z_{2}^{k + 1}, \dots, Z_{k + 1}^{k + 1}) = (α_{j_{1}^{'}}, α_{j_{2}^{'}}, \dots, α_{j_{k + 1}^{'}}))

P ((Z_{1}^{k + 1}, Z_{2}^{k + 1}, \dots, Z_{k + 1}^{k + 1}) = (α_{j_{1}^{'}}, α_{j_{2}^{'}}, \dots, α_{j_{k + 1}^{'}}))

= t = 2 \sum k P ((Z_{1}^{k + 1}, Z_{2}^{k + 1}, \dots, Z_{k + 1}^{k + 1}) = (α_{j_{1}^{'}}, α_{j_{2}^{'}}, \dots, α_{j_{k + 1}^{'}}) ∣ T_{k + 1} = t) P (T_{k + 1} = t)

= t = 2 \sum k P ((Z_{1}^{k}, \dots, Z_{t - 1}^{k}, Z_{t}^{k}, \dots, Z_{k}^{k}) = (α_{j_{1}^{'}}, \dots, α_{j_{t - 1}^{'}}, α_{j_{t + 1}^{'}}, \dots, α_{j_{k + 1}^{'}})) P (U_{k + 1} = α_{j_{t}^{'}}) P (T_{k + 1} = t)

= t = 2 \sum k (\frac{1}{m})^{k} \frac{1}{m} \frac{1}{k - 1}

= (\frac{1}{m})^{k + 1} .

Z^{(k + 1)} = d (\tilde{X}^{(n)} ∣ N = n - k - 1) .

Z^{(k + 1)} = d (\tilde{X}^{(n)} ∣ N = n - k - 1) .

E e^{i < u, \tilde{X}^{(n)} >}

E e^{i < u, \tilde{X}^{(n)} >}

= k = 0 \sum n E (e^{i < u, Z^{(n - k)} >}) P (N = k)

= k = 0 \sum n E (e^{i < u, Z^{(n - k)} >} ∣ N = k) P (N = k)

= k = 0 \sum n E (e^{i < u, Z^{(n - N)} >} ∣ N = k) P (N = k)

= E e^{i < u, Z^{(n - N)} >} .

Z^{(n - N)} = d \tilde{X}^{(n)} .

Z^{(n - N)} = d \tilde{X}^{(n)} .

L C_{n} = d L_{n} (n - N),

L C_{n} = d L_{n} (n - N),

Var L C_{n} = Var (L_{n} (n - N)) .

Var L C_{n} = Var (L_{n} (n - N)) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · semigroups and automata theory · Advanced Combinatorial Mathematics

Full text

On the Variance of the Length of the Longest Common Subsequences in Random Words With an Omitted Letter

Christian Houdré School of Mathematics, Georgia Institute of Technology, 686 Cherry Street, Atlanta, GA 30332-0160 USA, [email protected]. Research supported in part by the grant $\#246283$ and $\#524678$ from the Simons Foundation.

Qingqing Liu School of Mathematics, Georgia Institute of Technology, 686 Cherry Street, Atlanta, GA 30332-0160 USA, [email protected]

Abstract

We investigate the variance of the length of the longest common subsequences of two independent random words of size $n$ , where the letters of one word are i.i.d. uniformly drawn from $\{\alpha_{1},\alpha_{2},\cdots,\alpha_{m}\}$ , while the letters of the other word are i.i.d. drawn from $\{\alpha_{1},\alpha_{2},\cdots,\alpha_{m},\alpha_{m+1}\}$ , with probability $p>0$ to be $\alpha_{m+1}$ , and $(1-p)/m>0$ for all the other letters. The order of the variance of this length is shown to be linear in $n$ .

††Keywords: Longest common subsequences, variance, lower bound††MSC 2010: 60C05, 60F10, 05A05

1 Introduction and Statement of Results

Let $\bm{X}=(X_{i})_{i\geq 1}$ and $\bm{Y}=(Y_{i})_{i\geq 1}$ be two independent sequences of i.i.d. random variables taking their values in a finite common alphabet $\mathcal{A}$ , with $\operatorname{\mathbb{P}}(X_{1}=\alpha)=p_{x,\alpha}\geq 0$ and $\operatorname{\mathbb{P}}(Y_{1}=\alpha)=p_{y,\alpha}\geq 0$ , $\alpha\in\mathcal{A}$ . Let $LC_{n}$ be the largest $k$ such that there exist $1\leq i_{1}<\cdots<i_{k}\leq n$ and $1\leq j_{1}<\cdots<j_{k}\leq n$ with $X_{i_{s}}=Y_{j_{s}}$ for $s=1,\ldots,k$ , i.e., $LC_{n}$ denotes the length of the longest common subsequences of the random words $\bm{X}^{(n)}:=X_{1}\cdots X_{n}$ and $\bm{Y}^{(n)}:=Y_{1}\cdots Y_{n}$ . The limiting behavior of the expectation of $LC_{n}$ has been extensively studied. In particular, if for all $\alpha\in\mathcal{A}$ , $p_{x,\alpha}=p_{y,\alpha}=1/(\#\mathcal{A})$ , where $\#\mathcal{A}$ denotes the cardinality of $\mathcal{A}$ , the earliest result is due to Chvátal and Sankoff [3], who proved the existence of

[TABLE]

where $m$ denotes the alphabet size, showing also that $0.727273\leq\gamma_{2}^{*}\leq 0.905118$ . Much work has since been done to improve these bounds ([6], [4], [7], [5], $\ldots$ ), and to date the best known bounds seem to be $0.788071\leq\gamma_{2}^{*}\leq 0.826280$ , see [15]. These results have also been extended to multiple sequences and alphabet of size larger than two, e.g., see [11], [14] and the references therein.

The study of the variance of $LC_{n}$ is less complete. In case $p_{x,k}=p_{y,k}=p_{k}$ for $k=1,\ldots,m$ , the Efron-Stein inequality implies, as shown in [16], that

[TABLE]

.

For lower bounds, linear order results are also proved in various biased instances ([12], [9], [10], [13], [8], [1], [2], $\ldots$ ). For example, [12] and [9] assume that one of the letters has a significantly higher probability of appearing than any of the other letters in the alphabet, while [2] assumes that one of the two sequences is binary while the other is a trinary one. Our paper extends the result of [2] by removing the binary/trinary assumptions and provides precise estimates allowing us to go beyond the uniform case and to also deal with central moments.

To formally state our problem, let $\mathcal{A}:=\mathcal{A}_{m+1}=\{\alpha_{1},\alpha_{2},\cdots,\alpha_{m},\alpha_{m+1}\}$ , and let the letters distribution of $\bm{X}$ to be such that

[TABLE]

while the letters distribution of $\bm{Y}$ is such that

[TABLE]

To start with, an upper bound on the variance of $LC_{n}$ is shown to be

[TABLE]

for all $n\in\mathbb{N}$ . Indeed, the Efron–Stein inequality states that:

[TABLE]

where, $S=S(Z_{1},Z_{2},\cdots,Z_{n})$ and $S_{i}=S(Z_{1},Z_{2},\cdots,Z_{i-1},\hat{Z_{i}},Z_{i+1},\cdots Z_{n})$ , and where $(Z_{i})_{1\leq i\leq n}$ and $(\hat{Z_{i}})_{1\leq i\leq n}$ are independent copies of each other.

Now following [16],

[TABLE]

since when replacing $X_{i}$ by $\hat{X_{i}}$ , $LC_{n}$ changes by at most $1$ and at least $-1$ . Similarly,

[TABLE]

Applying (1.1) and combining the two bounds above give,

[TABLE]

To match the easy bound (1), we can now state the main result of this paper.

Theorem 1.

There exists a constant $C=C(p,m)>0$ independent of $n$ , such that for all $n\geq 1$ ,

[TABLE]

This theorem, combined with the upper bound (1), gives a linear order, in $n$ , for the variance of $LC_{n}$ , and we refer the reader to Section 4 for an estimate on $C$ .

2 Proof of Theorem 1

The scheme of the proof elaborates and extends elements of of [2] and [9]. So, let $N$ denote the number of letters $\alpha_{m+1}$ in the random word $\bm{X}^{(n)}$ . Clearly, $N$ is a binomial random variable with parameter $n$ and $p$ . Moreover, let $\tilde{\bm{X}}^{(n)}:=X_{i_{1}}\cdots X_{i_{k}}$ , where $1\leq i_{1}<\cdots<i_{k}\leq n$ , $X_{j}\neq\alpha_{m+1}$ for all $j\in\{i_{1},\ldots,i_{k}\}$ and $X_{j}=\alpha_{m+1}$ for all $j\in\{1,2,\ldots,n\}\backslash\{i_{1},\ldots,i_{k}\}$ . In words, $\tilde{\bm{X}}^{(n)}$ is the subword of $\bm{X}^{(n)}$ made only of non- $\alpha_{m+1}$ letters. To prove our main theorem, we will recursively define a finite random sequence $\bm{Z}^{(1)},\bm{Z}^{(2)},\ldots,\bm{Z}^{(n)}$ , where each $\bm{Z}^{(k)}$ has length $k$ , by inserting uniformly at random and at a uniform random location a letter from $\{\alpha_{1},\alpha_{2},\ldots,\alpha_{m}\}$ to the previous $\bm{Z}^{(k-1)}$ .

To formally describe the defining mechanism, let $\{U_{k}\}_{1\leq k\leq n}$ and $\{T_{k}\}_{3\leq k\leq n}$ be two independent sequences of random variables, where $\{U_{k}\}_{1\leq k\leq n}$ is a sequence of i.i.d. uniform random variables on $\{\alpha_{1},\alpha_{2},\ldots,\alpha_{m}\}$ , and $\{T_{k}\}_{3\leq k\leq n}$ is a sequence of independent random variables uniform on $\{2,3,\ldots,k-1\}$ , $k\geq 3$ .

Then as in [2], recursively define the sequence $\bm{Z}^{(k)}$ via:

(1)

$\bm{Z}^{(1)}=U_{1}$ . 2. (2)

$\bm{Z}^{(2)}=U_{1}U_{2}$ . 3. (3)

For $k\geq 2$ , given $\bm{Z}^{(k)}=Z_{1}^{k}Z_{2}^{k}\cdots Z_{k}^{k}$ , let $\bm{Z}^{(k+1)}$ be as follows:

•

For all $j<T_{k+1}$ , let

[TABLE]

•

For $j=T_{k+1}$ , let

[TABLE]

•

For all $j$ such that $T_{k+1}<j\leq k+1$ , let

[TABLE]

Hence, $\{Z_{i}^{k}\}_{1\leq i\leq k\leq n}$ is a triangular array of uniform random variables with values in $\{\alpha_{1},\alpha_{2},\ldots,\alpha_{m}\}$ , and finding the relation between $\bm{Z}^{(n-N)}$ and $\tilde{\bm{X}}^{(n)}$ is the purpose of our next lemma whose proof is akin to a corresponding proof in [9].

Lemma 1.

For any $n\geq 1$ and $1\leq k\leq n$ ,

[TABLE]

and moreover,

[TABLE]

where $\stackrel{{\scriptstyle\text{d}}}{{=}}$ denotes equality in distribution.

Proof.

The proof is by induction on $k$ . Let $k=1$ , by definition, $\bm{Z}^{(1)}=U_{1}$ , which has the same distribution as $(\tilde{\bm{X}}^{(n)}|N=n-1)$ . Next, assume that

[TABLE]

and so for any $(\alpha_{j_{1}},\alpha_{j_{2}},\ldots,\alpha_{j_{k}})\in\mathcal{A}^{k}$ ,

[TABLE]

Then,

[TABLE]

Thus,

[TABLE]

To prove the second part of the lemma, from the independence of $N$ and $\bm{Z}^{(n-k)}$ , for any $u\in\mathbb{R}^{n-k}$ ,

[TABLE]

Thus,

[TABLE]

∎

Now let $LC_{n}$ be the length of the longest common subsequences of $\bm{X}^{(n)}$ and $\bm{Y}^{(n)}$ , and let $L_{n}(k)$ be the length of the longest common subsequences/subwords of $\bm{Z}^{(k)}$ and $\bm{Y}^{(n)}$ . It follows from Lemma 1 that,

[TABLE]

and therefore,

[TABLE]

In order to prove the main result, we will also need the following result taken from [9].

Lemma 2.

Let $f:D\subset\mathbb{R}\to\mathbb{Z}$ satisfy a local reversed Lipschitz condition, i.e., let $h\geq 0$ and let $f$ be such that for any $i,j\in D$ with $j\geq i+h$ ,

[TABLE]

for some $c>0$ . Let $T$ be a $D$ -valued random variable with $\mathbb{E}|f(T)|^{2}<\infty$ , then

[TABLE]

Next, let

[TABLE]

where $I=[np-\sqrt{np(1-p)},np+\sqrt{np(1-p)}]$ , $K>0$ is a constant which does not depend on $n$ ( $K\leq 1/2m$ will do, see Lemma 10), and where $h(n)$ will also be made precise later. The event $O_{n}$ can be viewed as the event where the map $k\to L_{n}(k)$ locally satisfies a reversed Lipschitz condition.

In Section 3, we will prove

Theorem 2.

For all $n\geq 1$ ,

[TABLE]

where, $K$ is given in Lemma 10, $A=\max\{C_{4},C_{5},C_{7}\}$ , and $B=\min\{C_{3}\nu,C_{6},C_{8}\}$ , and these constants are given in (3.5), Lemma 6, and Lemma 8 respectively.

Now with the help of Theorem 2 we can provide the proof of our main result stated in Theorem 1.

Proof of Theorem 1.

By (2.2), it is sufficient to prove the lower bound for $\operatorname{Var}(L_{n}(n-N))$ . First as in [9], with its notation,

[TABLE]

and so, for any $n\geq 1$ ,

[TABLE]

Since $N$ is independent of $(L_{n}(n-k))_{0\leq k\leq n}$ , and from (2.5), for each $\omega\in\Omega$ ,

[TABLE]

where again,

[TABLE]

Again, for each $\omega\in O_{n}$ , from Lemma 2, and since $N$ is independent of $(L_{n}(n-k))_{0\leq k\leq n}$ ,

[TABLE]

Now, (2.6), (2) and (2.8) give

[TABLE]

and it remains to estimate each one of the three terms on the right hand side of (2.9). By the Berry-Esséen inequality, for all $n\geq 1$ ,

[TABLE]

Moreover,

[TABLE]

and

[TABLE]

where $F_{n}$ is the distribution functions of ${(N-np)}/{\sqrt{np(1-p)}}$ , while $\Phi$ is the standard normal one. Likewise,

[TABLE]

Next, using (2) – (2.13),

[TABLE]

Finally, the estimates (2.9)-(2.14) combined with the estimate on $\mathbb{P}(O_{n})$ obtained in Theorem 2 give the lower bound in Theorem 1, whenever $2\ln n/K^{2}\leq h(n)\leq K_{1}\sqrt{n}$ , where the upper bound on $h(n)$ stems from the requirement that the right hand side of (2.9) needs to be lower bounded and where $K_{1}$ is estimated in Section 4.

∎

3 Proof of Theorem 2

In this section, we prove the aforementioned theorem, therefore completing our proof of Theorem 1. Before doing so, we will need to state a few definitions and set some notations used throughout the rest of the paper:

The sequences $\bm{Z}^{(k)}$ and $\bm{Y}^{(n)}$ are said to have a common subsequence of length $\ell$ if there exist increasing functions $\pi:[1,\ell]\to[1,k]$ and $\eta:[1,\ell]\to[1,n]$ such that

[TABLE]

and $(\pi,\eta)$ is then called a pair of matching subsequences of $\bm{Z}^{(k)}$ and $\bm{Y}^{(n)}$ . Also, throughout, $M^{k}$ denotes the set of pairs of matching subsequences of $\bm{Z}^{(k)}$ and $\bm{Y}^{(n)}$ of maximal length.

Following the approach in [2], the proof of Theorem 2 is then divided into two cases, $k<\nu n$ and $k\geq\nu n$ , where in each case $\nu<1/m$ .

3.1 $k<\nu n$ ( $\nu<1/m$ )

We begin with the simpler case $k<\nu n$ . In this situation, we show that with high probability all the letters of $\bm{Z}^{(k)}$ are matched with letters of $\bm{Y}^{(n)}$ . Let

[TABLE]

Then clearly, $E_{k}^{(n)}\subset E_{k-1}^{(n)}\subset\cdots\subset E_{1}^{(n)}$ , and so

[TABLE]

Lemma 3.

For $\nu<1/m$ , there exists a constant $C_{1}=C_{1}(\nu,m)>0$ such that,

[TABLE]

Proof.

We construct a pair of matching sequence $(\pi,\eta)$ for $\bm{Z}^{(k)}=Z^{k}_{1}Z^{k}_{2}\cdots Z^{k}_{k}$ and $\bm{Y}$ as follows,

[TABLE]

where we also set $\eta(0)=0$ .

Thus, $\eta(i)$ is the smallest index $\ell$ such that $Z_{1}^{k}\cdots Z_{i}^{k}$ is a subsequence of $Y_{1}Y_{2}\cdots Y_{\ell}$ . In this way, $\eta(1),\eta(2),\eta(3),\cdots$ is a renewal process with geometrically distributed holding time, i.e., denoting the inter arrival times as

[TABLE]

then $\{T_{i}\}_{i\geq 1}$ is a sequence of independent geometric random variables with parameter $1/m$ , i.e.,

[TABLE]

Thus, $\mathbb{E}T_{i}=m$ . Next,

[TABLE]

and from the independence of the $\{T_{i}\}_{i\geq 1}$ ,

[TABLE]

This last term is minimized at

[TABLE]

thus,

[TABLE]

which is increasing in $\nu$ for $\nu\in(0,1-1/m)$ . Thus,

[TABLE]

Since $\nu<1/m$ , by taking $C_{1}=\ln\left({m(m-1)^{\nu-1}\nu^{\nu}}/{(1-\nu)^{\nu-1}}\right)$ , we have

[TABLE]

∎

Therefore, Lemma 3 asserts that

[TABLE]

3.2 $k\geq\nu n$ ( $\nu<1/m$ )

To continue, we introduce some more definitions and notations of use throughout the section.

(i)

Let $\leq$ denote the partial order between two increasing functions $\pi_{1},\pi_{2}:[1,\ell]\to\mathbb{N}$ , i.e., $\pi_{1}\leq\pi_{2}$ if for every $i\in[1,\ell]$ , $\pi_{1}(i)\leq\pi_{2}(i)$ . Further $(\pi_{1},\eta_{1})\leq(\pi_{2},\eta_{2})$ is short for $\pi_{1}\leq\pi_{2}$ and $\eta_{1}\leq\eta_{2}$ . 2. (ii)

Let $M_{min}^{k}\subset M^{k}$ be the set of $(\pi,\eta)\in M^{k}$ which are minimal for the relation $\leq$ , i.e., such that for $(\pi_{1},\eta_{1})\in M_{min}^{k}$ and $(\pi_{2},\eta_{2})\in M^{k}$ , if $(\pi_{1},\eta_{1})\geq(\pi_{2},\eta_{2})$ then $(\pi_{1},\eta_{1})=(\pi_{2},\eta_{2})$ . 3. (iii)

If $(\pi,\eta)$ is a pair of matching subsequences of $\bm{Z}^{(k)}$ and $\bm{Y}^{(n)}$ of length $\ell$ , a match of $(\pi,\eta)$ is then defined to be the quadruple

[TABLE]

Moreover, if $\eta(i)+2\leq\eta(i+1)$ , the match is said to be non-empty. Therefore, for a non-empty match, there exists $j$ , such that $\eta(i)<j<\eta(i+1)$ and $Y_{j}=\alpha$ for some $\alpha\in\mathcal{A}\setminus\{\alpha_{m+1}\}$ . In that case, the match is said to contain an $\alpha$ , and $Y_{j}$ is called an unmatched letter of the match $\left(\pi(i),\pi(i+1),\eta(i),\eta(i+1)\right)$ . 4. (iv)

The sequence $\bm{Y}^{(n)}$ can be uniquely divided into $d$ compartments $[j_{1},j_{2}-1],[j_{2},j_{3}-1],\ldots,[j_{d},n]$ , where $1=j_{1}<j_{2}<\cdots<j_{d}\leq n$ are determined by the following recursive relations:

[TABLE]

and $d=\max\{i:\,j_{i}\leq n\}$ .

To get a lower bound on the probability that the length of the longest common subsequence increases by one, we recall the construction of $\bm{Z}^{(k)}$ and note that there are $(k-1)$ possible positions for the letter $U_{k+1}$ to be inserted. Therefore, $U_{k+1}$ falls into a non-empty match with probability at least ${(\text{number of nonempty matches of }(\pi,\eta))}/{(k-1)}\geq{(\text{number of nonempty matches of }(\pi,\eta))}/{k}$ . For each non-empty match, there is at least one unmatched letter, and the probability that $U_{k+1}$ takes the same value as the unmatched letter is $1/m$ , resulting in the following lower bound for $(\pi,\eta)\in M^{k}$ :

[TABLE]

Therefore, a good estimate on the number of nonempty matches of $(\pi,\eta)$ will provide a lower bound on the probability that $LC_{n}$ increases by one.

Next we give the main ideas behind the proof that, with high probability, the map $k\to L(k)$ is linearly increasing on $[\nu n,n]$ . We use the letter-insertion scheme, described above, to prove that the random map $k\to L(k)$ typically has positive drift $\lambda$ (which will be determined later in Lemma 9). To do so, let

[TABLE]

and let

[TABLE]

When $F_{k}^{(n)}$ holds, every pair of $(\pi,\eta)\in M_{min}^{k}$ has at least $\lambda n$ nonempty matches. Hence the number of non-empty matches divided by $k$ is larger than or equal to $\lambda n/k$ . It follows from (3.1) that when $F_{k}^{(n)}$ holds,

[TABLE]

The inequality (3.3) implies that when $F^{(n)}$ holds, the map $k\to L_{n}(k)$ has drift at least $\lambda/m$ for $k\in[\nu n,n]$ . In other words, whenever $F^{(n)}$ holds, with high probability $k\to L_{n}(k)$ has positive slope on $[\nu n,n]$ .

It remains to show that, by concentration, $F^{(n)}$ holds with high probability, and this is proved by contradiction. Indeed if all the matches of $(\pi,\eta)\in M^{k}$ were empty, then the following two conditions would hold:

(1)

$(\eta(1),\eta(2),\eta(3),\cdots,\eta(\ell))=(\eta(1),\eta(1)+1,\eta(1)+2,\cdots,\eta(1)+\ell-1)$ where $\ell$ is the length of the LCS of $\bm{Z}^{(k)}$ and $\bm{Y}^{(n)}$ , i.e., $\ell=L_{n}(k)$ . 2. (2)

The sequence

[TABLE]

would be a subsequence of

[TABLE]

Above, we have two independent sequences of i.i.d. uniform random variables with parameter $1/m$ , where one is contained in the other as a subsequence. Thus, the longer one must approximately be at least $m$ times as long as the shorter one, hence $k$ is approximately at least $m$ times as long as $\ell=L_{n}(k)$ . As a result, the ratio $L_{n}(k)/k$ is to be at most $1/m$ , which is very unlikely (Lemma 6), leading to contradiction.

From the previous arguments, it follows that with high probability any $(\pi,\eta)\in M_{min}^{k}$ contains a non-vanishing proportion $\epsilon>0$ of unmatched letters, hence $(\eta(L_{n}(k))-L_{n}(k))/\eta(L_{n}(k))\geq\epsilon$ , where $\eta(L_{n}(k))$ is the index of the last matching letter in $\bm{Y}^{(n)}$ of the match $(\pi,\eta)$ . We then show that this proportion $\epsilon$ of unmatched letters generates sufficiently many non-empty matches, i.e., that the unmatched letters should not be concentrated on a too small number of matches.

To prove that there are more than $\lambda n$ nonempty matches, the following two arguments are used:

(1)

Any $(\pi,\eta)\in M_{min}^{k}$ is such that every match of $(\pi,\eta)$ contains unmatched letters from at most one compartment of $\bm{Y}^{(n)}$ . 2. (2)

There exists a $D>0$ , not depending on $n$ , such that, with high probability, the total number of integer points contained in the compartments of $\bm{Y}^{(n)}$ of length larger than $D$ , is small.

Henceforth, for $(\pi,\eta)\in M_{min}^{k}$ the majority of unmatched letters are at most $D$ per match, ensuring that a proportion $\epsilon$ of unmatched letters implies a proportion of at least $\epsilon/D$ non-empty matches.

Let us return to the proof, and let $L_{\ell}(k)$ denote the length of the LCS of $\bm{Z}^{(k)}$ and $\bm{Y}^{(\ell)}=Y_{1}\cdots Y_{\ell}$ . In order for $\bm{Y}^{(\ell)}$ to be contained in $\bm{Z}^{(k)}$ , $k$ needs to be approximately $m$ times as long as $\ell$ , and, then, $L_{\ell}(k)=\ell$ . Therefore, if $k=m\ell(1-\delta)$ , for some $\delta=\delta(\epsilon)>0$ not depending on $\ell$ , then it is extremely unlikely that $\bm{Y}^{(\ell)}$ is a subsequence of $\bm{Z}^{(k)}$ , as shown in the forthcoming lemma.

Lemma 4.

For any $0<\delta<(m-1)/m$ and $\ell\geq 1$ , we have

[TABLE]

where $C_{2}=m/2(m-1)$ .

Proof.

The proof is similar to the proof of Lemma 3 and some of its notation is used.

First let $\tilde{\bm{X}}:=\tilde{\bm{X}}^{(\infty)}$ , be the (infinite) subword of $\bm{X}$ with $\alpha_{m+1}$ removed, and therefore each $\tilde{\bm{X}}^{(n)}$ is a subword of $\tilde{\bm{X}}$ . Next, construct a pair of matching sequence $(\pi,\eta)$ for $\tilde{\bm{X}}$ and $\bm{Y}^{(\ell)}$ as follows:

[TABLE]

Thus, $\pi(i)$ is the smallest index $j$ such that $Y_{1}Y_{2}\cdots Y_{i}$ is a subsequence of $\tilde{X}_{1}\cdots\tilde{X}_{j}$ . In this way, $\pi(1),\pi(2),\pi(3),\cdots$ is a renewal process with geometrically distributed holding time, i.e., denoting the interarrival times as

[TABLE]

then $\{T_{i}\}_{i\geq 1}$ is a sequence of independent geometric random variables with parameter $1/m$ , i.e.,

[TABLE]

Thus, $\mathbb{E}T_{i}=m$ . Then by Lemma 1 and for $0<\delta<1$ , we have

[TABLE]

This last term is minimized at

[TABLE]

thus setting,

[TABLE]

it follows that,

[TABLE]

Now, the Taylor expansion of $\ln w$ with Lagrange remainder gives

[TABLE]

where $0<\xi<\delta$ . Letting $C_{2}=m/2(m-1)$ finishes the proof. ∎

Lemma 4 further entails, as shown next, that for any $0<\epsilon<1$ there exists $\delta(\epsilon)>0$ , small, such that $L_{\ell}(m\ell(1-\delta(\epsilon)))\geq\ell(1-\epsilon)$ is also very unlikely.

Lemma 5.

For any $0<\epsilon<1$ and all $\ell\geq 1$ , there exists $\delta(\epsilon)>0$ , with $\displaystyle\lim_{\epsilon\to 0}\delta(\epsilon)\to 0$ , such that

[TABLE]

where $G^{(n)}_{\ell}(\epsilon)=\{L_{\ell}(m\ell(1-\delta(\epsilon)))<\ell(1-\epsilon)\}$ , and where $C_{3}:=(\delta(\epsilon)-\epsilon)^{2}C_{2}/2$ . Therefore, letting

[TABLE]

it follows that,

[TABLE]

where $C_{4}={1}/{(1-e^{-C_{3}})}$ .

Proof.

Let $S\subset\{1,2,\cdots,\ell\}$ have cardinality $(1-\epsilon)\ell$ . Clearly, there are $\binom{\ell}{\ell(1-\epsilon)}$ such subsets $S$ . Now fixing the values of $\bm{Y}^{(n)}$ at the indices belonging to $S$ , there are $m^{\epsilon\ell}$ such $\bm{Y}^{(n)}$ agreeing on $S$ . Therefore,

[TABLE]

From (3.4),

[TABLE]

Collecting the above estimates,

[TABLE]

Since

[TABLE]

then

[TABLE]

Therefore, (3.6) becomes

[TABLE]

and it is enough to choose

[TABLE]

to obtain the stated result. ∎

Lemma 6 and Lemma 7, presented next, formalize our contradictory argument asserted above. To show that it is very unlikely that “the ratio $L_{n}(k)/k$ is at most $1/m$ ”, note, at first, that for $n\geq 2$ ,

[TABLE]

Specifically, when $n=2$ , see [3],

[TABLE]

Now, choose $\xi_{m}$ such that

[TABLE]

and let us show that very likely $L_{n}(k)/k$ is larger than $\xi_{m}$ . To do so, let

[TABLE]

and

[TABLE]

Lemma 6.

There exist constants $C_{5},C_{6}>0$ , such that

[TABLE]

Proof.

Divide the sequences $\bm{Z}^{(k)}$ and $\bm{Y}^{(n)}$ into subsequences of length 2, as given in the previous lemma. Then, by superadditivity, $L_{k}(k)\geq\sum_{i=1}^{k/2}\hat{L}_{i}$ , where $\hat{L}_{i}$ is the length of the longest common subsequence between $Y_{2(i-1)+1}Y_{2i}$ and $Z^{k}_{2(i-1)+1}Z^{k}_{2i}$ . Clearly, by the i.i.d. assumptions, $\mathbb{E}(\hat{L}_{i})=\mathbb{E}({L}_{2}(2))$ is constant. Hence for $\tau>0$ ,

[TABLE]

Now let $p(s,\tau):=\mathbb{E}\left(e^{s\left(\hat{L}_{1}-(\mathbb{E}({L}_{2}(2))-\tau)\right)}\right)$ , it is easy to see that $p(s,\tau)$ is smooth in $s$ , and that

[TABLE]

for every $\tau>0$ . Hence,

[TABLE]

for a suitable $c(\tau)>0$ . Thus,

[TABLE]

Now, let $\tau=\tau_{m}:=\mathbb{E}({L}_{2}(2))-2\xi_{m}$ , let $\xi_{m}=11/10m$ , and so

[TABLE]

Since $\inf_{s<0}p(s,\tau_{m})<e^{-1/1000m}$ , one can choose $c(\tau_{m})=1/1000m$ . Hence,

[TABLE]

Choosing $C_{5}=\left.{e^{c(\tau_{m})/2}}\middle/{(e^{c(\tau_{m})/2}-1)}\right.$ , and $C_{6}=c(\tau_{m})(\nu)/2$ , we have,

[TABLE]

∎

We now finish our argument showing that, with high probability, any $(\pi,\eta)\in M_{min}^{k}$ contains a non-vanishing proportion $\epsilon>0$ of unmatched letters. To do so, let

[TABLE]

be the event that any pair of matching subsequences $(\pi,\eta)\in M_{min}^{k}$ has a proportion at least $\epsilon$ of unmatched letters, and let

[TABLE]

Above, $\eta(L_{n}(k))-L_{n}(k)$ is the number of unmatched letters, since $\eta(L_{n}(k))$ is the position of the last matched letter, while $L_{n}(k)$ is the number of matched letters.

Lemma 7.

Let $\epsilon>0$ be small enough such that $\delta(\epsilon)$ , as given in (3.7), satisfies

[TABLE]

where $\xi_{m}$ is as in (3.10). Then, for all $k\geq\nu n$ ,

[TABLE]

and thus

[TABLE]

Proof.

Let $k\in[\nu n,n]$ . In order to prove (3.15), we show that if $I^{(n)}_{k}$ does not hold while $G^{(n)}(\epsilon)$ does hold, then $H^{(n)}_{k}$ does not hold either. Let $(\pi,\eta)\in M_{min}^{k}$ . If $I^{(n)}_{k}$ does not hold, than the proportion of unmatched letters of $(\pi,\eta)$ is smaller than $\epsilon$ , i.e.,

[TABLE]

where $\ell:=\eta(L_{n}(k))$ . (Note that $L_{\ell}(k)=L_{n}(k)$ , since $(\pi,\eta)$ is of maximal length.) Therefore,

[TABLE]

Now, when $G^{(n)}_{\ell}(\epsilon)$ holds, then

[TABLE]

Comparing (3.17) with (3.18) and noting that the (random) map $x\mapsto L_{\ell}(x)$ is increasing, yield

[TABLE]

and thus

[TABLE]

Hence, from (3.14),

[TABLE]

which implies that $H^{(n)}_{k}$ cannot hold. ∎

As an example, when $\epsilon\leq e^{-9}/(1+\ln{m})$ ,

[TABLE]

and therefore,

[TABLE]

In order to estimate the event $F^{(n)}$ , we need to show that the unmatched letters of $\bm{Y}^{(n)}$ do not concentrate in a small number of matches of $(\pi,\eta)\in M_{min}^{k}$ . From the minimality of $M_{min}^{k}$ , the unmatched letters of a match of $(\pi,\eta)\in M_{min}^{k}$ contain at most one compartment.

Let $N^{D}$ be the total number of letters in the sequence $\bm{Y}^{(n)}$ contained in a compartment of length at least $D$ , and let,

[TABLE]

where again $\xi_{m}$ is given via (3.10).

Lemma 8.

For any $0<\epsilon<1$ , there exist a positive integer $D$ , and positive constant $C_{7}$ and $C_{8}$ depending on $D$ , such that

[TABLE]

Proof.

Let $\tilde{N}^{D}$ be the number of integers $s\in[0,n-D]$ such that

[TABLE]

It is easy to check that

[TABLE]

Let now $\tilde{Y}_{s}$ , $s\in[0,n-D]$ , be equal to 1 if and only if (3.20) holds, and 0 otherwise. Clearly,

[TABLE]

To estimate the sum (3.22), decompose it into $D$ subsums of i.i.d. random variables $\Sigma_{1},\Sigma_{2},\ldots,\Sigma_{D}$ where

[TABLE]

so that

[TABLE]

Then, from (3.21)

[TABLE]

since in (3.23) at least one of the summands has to be larger than $n{\xi_{m}\epsilon\nu}/2D^{2}$ . Now, the $\tilde{Y}_{s}$ appearing in the subsum $\Sigma_{1}$ are i.i.d. Bernoulli random variables with

[TABLE]

Therefore,

[TABLE]

with $c(\delta)>0$ for $\delta>0$ . Take $\delta=\mathbb{P}(\tilde{Y}_{s}=0)=1-\mathbb{P}(\tilde{Y}_{s}=1)$ , then $c(\delta)=-\ln{\mathbb{P}(\tilde{Y}_{s}=1)}$ . Thus it is enough to choose $D$ such that

[TABLE]

Let $x=(m-1)/m$ , $y=\xi_{m}\nu\epsilon/2m$ , we next show that,

[TABLE]

does satisfy (3.26), or equivalently that $Dx^{D}<y$ . With the choice in (3.27), $Dx^{D}<y$ is equivalent to $2y\ln x\ln y+2y(\ln x)^{2}<1$ , which is true since

[TABLE]

Choosing $C_{7}=D$ and $C_{8}=c(\delta)/D$ , we have

[TABLE]

∎

We can now find a suitable $\lambda$ such that when $H^{(n)}$ , $I^{(n)}$ and $J^{(n)}$ all hold, then $F^{(n)}$ (which depends on $\lambda$ , see (3.2)) also holds.

Lemma 9.

Let $\epsilon>0$ be as in Lemma 7, let $D$ be such that $2Dm\left(({m-1})/{m}\right)^{D}<{\xi_{m}\nu\epsilon}$ , and let

[TABLE]

Then, for $k\geq\nu n$ ,

[TABLE]

and thus

[TABLE]

Proof.

We prove (3.28), from which (3.29) immediately follows. On $I^{(n)}_{k}$ , each $(\pi,\eta)\in M_{min}^{k}$ has at least $\epsilon\eta(L_{n}(k))$ unmatched letters. But,

[TABLE]

When $H^{(n)}$ holds,

[TABLE]

Since $k\geq\nu n$ , (3.30) and (3.31), together imply that the number of unmatched letters of $(\pi,\eta)\in M_{min}^{k}$ is at least $\epsilon\;\xi_{m}\nu n.$ By $J^{(n)}$ , there are at most $\xi_{m}\nu\epsilon n/2$ letters contained in compartments of length at least $D$ . Thus, there are at least $\xi_{m}\nu\epsilon n/2$ unmatched letters contained in compartments of length less than $D$ . But, every match of $(\pi,\eta)\in M_{min}^{k}$ contains unmatched letters from only one compartment, and as such every match can contain at most $D-1$ unmatched letters from compartments of length less than $D$ . Therefore, these $\epsilon\,\xi_{m}\nu n/2$ unmatched letters which are not in $N^{D}$ , must fill at least $\epsilon\,\xi_{m}\nu n/(2D-2)$ matches of $(\pi,\eta)\in M_{min}^{k}$ . Hence, $(\pi,\eta)\in M_{min}^{k}$ has at least $\epsilon\,\xi_{m}\nu n/(2D-2)$ non-empty matches. ∎

Combining Lemma 7 and Lemma 9 gives,

[TABLE]

which via (3.5), (3.11), and (3.19) entails

[TABLE]

Next, recalling the definition of $O_{n}$ in (2.3), observe that

[TABLE]

The next result estimates the first probability, on the above right hand side, and, therefore, completes the proof of Theorem 2.

Lemma 10.

Let $K\leq 1/2m$ , then

[TABLE]

Proof.

Let $\lambda$ given as in Lemma 9 be at most 1, and let $K:=\lambda/{2m}$ , so that $K\leq 1/{2m}$ . Let

[TABLE]

From (3.3), it follows that:

[TABLE]

where $\sigma_{k}$ denote the $\sigma$ -field generated by the $Z_{i}^{k}$ and $Y_{j}$ , namely,

[TABLE]

Moreover, $\Delta(k)$ is equal to zero or one (since $L_{n}(\cdot)$ is non-decreasing on $\mathbb{N}$ ) and is also $\sigma_{k}$ -measurable. Let

[TABLE]

Note that when $F^{(n)}$ holds, then

[TABLE]

for all $k\in[0,n-1]$ . Define

[TABLE]

and

[TABLE]

When $E^{(n)}$ holds, then $L_{n}(k)$ has a slope of one on the domain $[0,\nu n]$ . Therefore, since $K\leq{1}/{2m}$ , the slope condition of $O_{n}$ holds on the domain $[0,\nu n]\cap I$ . When $F^{(n)}$ holds, then $L_{n}(k)$ and $\tilde{L}_{n}(k)$ are equal. Therefore, when $F^{(n)}$ and $\tilde{O}_{n}$ both hold, then the slope condition of $O_{n}$ is verified on the domain $[\nu n,n]\cap I$ . Hence,

[TABLE]

and thus

[TABLE]

It only remains to estimate $\mathbb{P}(\tilde{O}_{n}^{c})$ . First,

[TABLE]

Then, from Hoeffding’s exponential inequality, for any $t>0$ ,

[TABLE]

With the help of (3.32), and since $K=\lambda/2m$ , by choosing $t=\mathbb{E}\Delta(i)-K$ , (3.36) becomes

[TABLE]

for all $i,j\in[\nu n,n]$ . Then, note that there are at most $n$ terms in the sum in (3.35). Thus (3.35) and (3.37) together imply that

[TABLE]

∎

4 Estimation of the Constants

To estimate $C$ in (1.3), we need to first estimate various constants.

First let $\nu=1/2m$ . Next, to estimate $K_{1}$ , the right hand side of (2.9) needs to be lower bounded. When $n\geq 900/(p(1-p))$ , (2.14) gives that

[TABLE]

Therefore, any $K_{1}$ satisfying $0<K_{1}<\sqrt{p(1-p)}/(10\sqrt{10})$ is fine. Choosing $K_{1}=\sqrt{p(1-p)}/(20\sqrt{5})$ , then

[TABLE]

To estimate $A$ and $B$ in (2.4) requires upper bounds on $C_{4}$ , $C_{5}$ , $C_{7}$ and lower bounds for $C_{3}$ , $C_{6}$ , $C_{8}$ . As shown after Lemma 7, we can choose $\epsilon=e^{-9}/(1+\ln{m})$ , then

[TABLE]

and

[TABLE]

Lemma 6 gives

[TABLE]

and

[TABLE]

Lemma 8 gives

[TABLE]

and

[TABLE]

Therefore, one can take $A=\max\{1+2000m,20e^{9}\}$ and $B=e^{-10}/m^{2}$ . Then, for $n\geq e^{10}m^{2}\ln{(80e^{9}+8000m)}$ , $\mathbb{P}(O_{n})\geq 1/2$

Note that when $n\geq 400/(p(1-p))$ , we also have $\mathbb{P}(N\in I)\geq 1/2$ . Let

[TABLE]

and let

[TABLE]

then one can choose $C=\min\{C_{9},C_{10}\}$ in (1.3).

5 Concluding Remarks

•

The results of the paper show that we can approach as closely as we want the uniform case and have a linear order on the variance of $LC_{n}$ . However, the lower order of the variance in the uniform case is still unknown although numerical results, see [14], leave little doubt that the variance is linear in the length of the words. (Unfortunately, the estimates of the previous section, on $C=C(p,m)$ in (1.3), converge to zero as $p\to 0$ .)

•

Combining the above results with techniques and results presented in [9], the upper and lower bound obtained above can be generalized to provide estimates of order $n^{r/2}$ , $r\geq 1$ , on the centered $r$ -th moment of $LC_{n}$ .

•

Finally, the above results might also be extended to the general case where the letters of one sequence are taken with probability $p_{i}$ , $i=1,2,\ldots,m$ , where $p_{i}>0$ and $\sum_{i=1}^{m}p_{i}=1$ , while for the other sequence the first $m$ letters are taken with probability $p_{i}-r_{i}>0$ and the extra letter is taken with probability $p=\sum_{i=1}^{m}r_{i}$ . Then many of the lemmas remain true replacing $1/m$ by $\inf_{i=1,\ldots,m}p_{i}$ or $\inf_{i=1,\ldots,m}(p_{i}-r_{i})/(1-\sum_{k=1}^{m}r_{k})$ . For example, in the heading of Section 3.1 and Section 3.2, in (3.1), (3.3), (3.8), and Lemma 10, the $1/m$ can be replaced by $\inf_{i=1,\ldots,m}p_{i}$ . In (3.4) of Lemma 4, and in the definition of $G^{(n)}_{\ell}(\epsilon)$ in Lemma 5, the term $L_{\ell}(m\ell(1-\delta))$ would have to be replaced with

[TABLE]

However, some constants that needs delicate estimations, such as $\xi_{m}$ , could be a further research topic.

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Saba Amsalu, Christian Houdré, and Heinrich Matzinger. Sparse Long Blocks and the Variance of the Longest Common Subsequences in Random Words. ar Xiv:1204.1009 v 2 [math-ph] , September 2016.
2[2] Federico Bonetto and Heinrich Matzinger. Fluctuations of the Longest Common Subsequence in the Asymmetric Case of 2- and 3-Letter Alphabets. Latin American Journal of Probability and Mathematical Statistics , 2:195–216, 2006.
3[3] Vacláv Chvátal and David Sankoff. Longest Common Subsequences of Two Random Sequences. Journal of Applied Probability , 12(2):306–315, 1975.
4[4] Vacláv Chvátal and David Sankoff. An Upper-bound Technique for Lengths of Common Subsequences. In David Sankoff and Joseph Kruskal, editors, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison . Addison-Wesley, Reading, Massachusetts, 1983.
5[5] Vladimír Dancík. Expected Length of Longest Common Subsequences . Ph D thesis, 1994.
6[6] Joseph G. Deken. Some Limit Results for Longest Common Subsequences. Discrete Mathematics , 26(1):17–31, January 1979.
7[7] Joseph G. Deken. Probabilistic Behavior of Longest-Common-Subsequence Length. In David Sankoff and Joseph Kruskal, editors, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison . Addison-Wesley, Reading, Massachusetts, 1983.
8[8] Ruoting Gong, Christian Houdré, and Jüri Lember. Lower Bounds on the Generalized Central Moments of the Optimal Alignments Score of Random Sequences. Journal of Theoretical Probability , pages 1–41, December 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

On the Variance of the Length of the Longest Common Subsequences in Random Words With an Omitted Letter

Abstract

1 Introduction and Statement of Results

Theorem 1**.**

2 Proof of Theorem 1

Lemma 1**.**

Proof.

Lemma 2**.**

Theorem 2**.**

Proof of Theorem 1.

3 Proof of Theorem 2

3.1 k<νnk<\nu nk<νn (ν<1/m\nu<1/mν<1/m)

Lemma 3**.**

Proof.

3.2 k≥νnk\geq\nu nk≥νn (ν<1/m\nu<1/mν<1/m)

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

Lemma 7**.**

Proof.

Lemma 8**.**

Proof.

Lemma 9**.**

Proof.

Lemma 10**.**

Proof.

4 Estimation of the Constants

5 Concluding Remarks

Theorem 1.

Lemma 1.

Lemma 2.

Theorem 2.

3.1 $k<\nu n$ ( $\nu<1/m$ )

Lemma 3.

3.2 $k\geq\nu n$ ( $\nu<1/m$ )

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.