On the Variance of the Length of the Longest Common Subsequences in Random Words With an Omitted Letter
Christian Houdr\'e, Qingqing Liu

TL;DR
This paper analyzes the variance of the longest common subsequence length between two random words, where one contains an extra letter with a certain probability, showing the variance grows linearly with word length.
Contribution
It establishes that the variance of the LCS length is linear in the size of the words in a setting with an omitted letter and probabilistic letter distributions.
Findings
Variance of LCS length is linear in n.
The presence of an extra letter affects the variance growth.
Results extend understanding of LCS behavior in non-uniform random words.
Abstract
We investigate the variance of the length of the longest common subsequences of two independent random words of size , where the letters of one word are i.i.d. uniformly drawn from , while the letters of the other word are i.i.d. drawn from , with probability to be , and for all the other letters. The order of the variance of this length is shown to be linear in .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · semigroups and automata theory · Advanced Combinatorial Mathematics
On the Variance of the Length of the Longest Common Subsequences in Random Words With an Omitted Letter
Christian Houdré School of Mathematics, Georgia Institute of Technology, 686 Cherry Street, Atlanta, GA 30332-0160 USA, [email protected]. Research supported in part by the grant and from the Simons Foundation.
Qingqing Liu School of Mathematics, Georgia Institute of Technology, 686 Cherry Street, Atlanta, GA 30332-0160 USA, [email protected]
Abstract
We investigate the variance of the length of the longest common subsequences of two independent random words of size , where the letters of one word are i.i.d. uniformly drawn from , while the letters of the other word are i.i.d. drawn from , with probability to be , and for all the other letters. The order of the variance of this length is shown to be linear in .
††Keywords: Longest common subsequences, variance, lower bound††MSC 2010: 60C05, 60F10, 05A05
1 Introduction and Statement of Results
Let and be two independent sequences of i.i.d. random variables taking their values in a finite common alphabet , with and , . Let be the largest such that there exist and with for , i.e., denotes the length of the longest common subsequences of the random words and . The limiting behavior of the expectation of has been extensively studied. In particular, if for all , , where denotes the cardinality of , the earliest result is due to Chvátal and Sankoff [3], who proved the existence of
[TABLE]
where denotes the alphabet size, showing also that . Much work has since been done to improve these bounds ([6], [4], [7], [5], ), and to date the best known bounds seem to be , see [15]. These results have also been extended to multiple sequences and alphabet of size larger than two, e.g., see [11], [14] and the references therein.
The study of the variance of is less complete. In case for , the Efron-Stein inequality implies, as shown in [16], that
[TABLE]
.
For lower bounds, linear order results are also proved in various biased instances ([12], [9], [10], [13], [8], [1], [2],). For example, [12] and [9] assume that one of the letters has a significantly higher probability of appearing than any of the other letters in the alphabet, while [2] assumes that one of the two sequences is binary while the other is a trinary one. Our paper extends the result of [2] by removing the binary/trinary assumptions and provides precise estimates allowing us to go beyond the uniform case and to also deal with central moments.
To formally state our problem, let , and let the letters distribution of to be such that
[TABLE]
while the letters distribution of is such that
[TABLE]
To start with, an upper bound on the variance of is shown to be
[TABLE]
for all . Indeed, the Efron–Stein inequality states that:
[TABLE]
where, and , and where and are independent copies of each other.
Now following [16],
[TABLE]
since when replacing by , changes by at most and at least . Similarly,
[TABLE]
Applying (1.1) and combining the two bounds above give,
[TABLE]
To match the easy bound (1), we can now state the main result of this paper.
Theorem 1**.**
There exists a constant independent of , such that for all ,
[TABLE]
This theorem, combined with the upper bound (1), gives a linear order, in , for the variance of , and we refer the reader to Section 4 for an estimate on .
2 Proof of Theorem 1
The scheme of the proof elaborates and extends elements of of [2] and [9]. So, let denote the number of letters in the random word . Clearly, is a binomial random variable with parameter and . Moreover, let , where , for all and for all . In words, is the subword of made only of non- letters. To prove our main theorem, we will recursively define a finite random sequence , where each has length , by inserting uniformly at random and at a uniform random location a letter from to the previous .
To formally describe the defining mechanism, let and be two independent sequences of random variables, where is a sequence of i.i.d. uniform random variables on , and is a sequence of independent random variables uniform on , .
Then as in [2], recursively define the sequence via:
- (1)
. 2. (2)
. 3. (3)
For , given , let be as follows:
- •
For all , let
[TABLE]
- •
For , let
[TABLE]
- •
For all such that , let
[TABLE]
Hence, is a triangular array of uniform random variables with values in , and finding the relation between and is the purpose of our next lemma whose proof is akin to a corresponding proof in [9].
Lemma 1**.**
For any and ,
[TABLE]
and moreover,
[TABLE]
where denotes equality in distribution.
Proof.
The proof is by induction on . Let , by definition, , which has the same distribution as . Next, assume that
[TABLE]
and so for any ,
[TABLE]
Then,
[TABLE]
Thus,
[TABLE]
To prove the second part of the lemma, from the independence of and , for any ,
[TABLE]
Thus,
[TABLE]
∎
Now let be the length of the longest common subsequences of and , and let be the length of the longest common subsequences/subwords of and . It follows from Lemma 1 that,
[TABLE]
and therefore,
[TABLE]
In order to prove the main result, we will also need the following result taken from [9].
Lemma 2**.**
Let satisfy a local reversed Lipschitz condition, i.e., let and let be such that for any with ,
[TABLE]
for some . Let be a -valued random variable with , then
[TABLE]
Next, let
[TABLE]
where , is a constant which does not depend on ( will do, see Lemma 10), and where will also be made precise later. The event can be viewed as the event where the map locally satisfies a reversed Lipschitz condition.
In Section 3, we will prove
Theorem 2**.**
For all ,
[TABLE]
where, is given in Lemma 10, , and , and these constants are given in (3.5), Lemma 6, and Lemma 8 respectively.
Now with the help of Theorem 2 we can provide the proof of our main result stated in Theorem 1.
Proof of Theorem 1.
By (2.2), it is sufficient to prove the lower bound for . First as in [9], with its notation,
[TABLE]
and so, for any ,
[TABLE]
Since is independent of , and from (2.5), for each ,
[TABLE]
where again,
[TABLE]
Again, for each , from Lemma 2, and since is independent of ,
[TABLE]
Now, (2.6), (2) and (2.8) give
[TABLE]
and it remains to estimate each one of the three terms on the right hand side of (2.9). By the Berry-Esséen inequality, for all ,
[TABLE]
Moreover,
[TABLE]
and
[TABLE]
where is the distribution functions of , while is the standard normal one. Likewise,
[TABLE]
[TABLE]
Finally, the estimates (2.9)-(2.14) combined with the estimate on obtained in Theorem 2 give the lower bound in Theorem 1, whenever , where the upper bound on stems from the requirement that the right hand side of (2.9) needs to be lower bounded and where is estimated in Section 4.
∎
3 Proof of Theorem 2
In this section, we prove the aforementioned theorem, therefore completing our proof of Theorem 1. Before doing so, we will need to state a few definitions and set some notations used throughout the rest of the paper:
The sequences and are said to have a common subsequence of length if there exist increasing functions and such that
[TABLE]
and is then called a pair of matching subsequences of and . Also, throughout, denotes the set of pairs of matching subsequences of and of maximal length.
Following the approach in [2], the proof of Theorem 2 is then divided into two cases, and , where in each case .
3.1 ()
We begin with the simpler case . In this situation, we show that with high probability all the letters of are matched with letters of . Let
[TABLE]
Then clearly, , and so
[TABLE]
Lemma 3**.**
For , there exists a constant such that,
[TABLE]
Proof.
We construct a pair of matching sequence for and as follows,
[TABLE]
where we also set .
Thus, is the smallest index such that is a subsequence of . In this way, is a renewal process with geometrically distributed holding time, i.e., denoting the inter arrival times as
[TABLE]
then is a sequence of independent geometric random variables with parameter , i.e.,
[TABLE]
Thus, . Next,
[TABLE]
and from the independence of the ,
[TABLE]
This last term is minimized at
[TABLE]
thus,
[TABLE]
which is increasing in for . Thus,
[TABLE]
Since , by taking , we have
[TABLE]
∎
Therefore, Lemma 3 asserts that
[TABLE]
3.2 ()
To continue, we introduce some more definitions and notations of use throughout the section.
- (i)
Let denote the partial order between two increasing functions , i.e., if for every , . Further is short for and . 2. (ii)
Let be the set of which are minimal for the relation , i.e., such that for and , if then . 3. (iii)
If is a pair of matching subsequences of and of length , a match of is then defined to be the quadruple
[TABLE]
Moreover, if , the match is said to be non-empty. Therefore, for a non-empty match, there exists , such that and for some . In that case, the match is said to contain an , and is called an unmatched letter of the match . 4. (iv)
The sequence can be uniquely divided into compartments , where are determined by the following recursive relations:
[TABLE]
and .
To get a lower bound on the probability that the length of the longest common subsequence increases by one, we recall the construction of and note that there are possible positions for the letter to be inserted. Therefore, falls into a non-empty match with probability at least . For each non-empty match, there is at least one unmatched letter, and the probability that takes the same value as the unmatched letter is , resulting in the following lower bound for :
[TABLE]
Therefore, a good estimate on the number of nonempty matches of will provide a lower bound on the probability that increases by one.
Next we give the main ideas behind the proof that, with high probability, the map is linearly increasing on . We use the letter-insertion scheme, described above, to prove that the random map typically has positive drift (which will be determined later in Lemma 9). To do so, let
[TABLE]
and let
[TABLE]
When holds, every pair of has at least nonempty matches. Hence the number of non-empty matches divided by is larger than or equal to . It follows from (3.1) that when holds,
[TABLE]
The inequality (3.3) implies that when holds, the map has drift at least for . In other words, whenever holds, with high probability has positive slope on .
It remains to show that, by concentration, holds with high probability, and this is proved by contradiction. Indeed if all the matches of were empty, then the following two conditions would hold:
- (1)
where is the length of the LCS of and , i.e., . 2. (2)
The sequence
[TABLE]
would be a subsequence of
[TABLE]
Above, we have two independent sequences of i.i.d. uniform random variables with parameter , where one is contained in the other as a subsequence. Thus, the longer one must approximately be at least times as long as the shorter one, hence is approximately at least times as long as . As a result, the ratio is to be at most , which is very unlikely (Lemma 6), leading to contradiction.
From the previous arguments, it follows that with high probability any contains a non-vanishing proportion of unmatched letters, hence , where is the index of the last matching letter in of the match . We then show that this proportion of unmatched letters generates sufficiently many non-empty matches, i.e., that the unmatched letters should not be concentrated on a too small number of matches.
To prove that there are more than nonempty matches, the following two arguments are used:
- (1)
Any is such that every match of contains unmatched letters from at most one compartment of . 2. (2)
There exists a , not depending on , such that, with high probability, the total number of integer points contained in the compartments of of length larger than , is small.
Henceforth, for the majority of unmatched letters are at most per match, ensuring that a proportion of unmatched letters implies a proportion of at least non-empty matches.
Let us return to the proof, and let denote the length of the LCS of and . In order for to be contained in , needs to be approximately times as long as , and, then, . Therefore, if , for some not depending on , then it is extremely unlikely that is a subsequence of , as shown in the forthcoming lemma.
Lemma 4**.**
For any and , we have
[TABLE]
where .
Proof.
The proof is similar to the proof of Lemma 3 and some of its notation is used.
First let , be the (infinite) subword of with removed, and therefore each is a subword of . Next, construct a pair of matching sequence for and as follows:
[TABLE]
Thus, is the smallest index such that is a subsequence of . In this way, is a renewal process with geometrically distributed holding time, i.e., denoting the interarrival times as
[TABLE]
then is a sequence of independent geometric random variables with parameter , i.e.,
[TABLE]
Thus, . Then by Lemma 1 and for , we have
[TABLE]
This last term is minimized at
[TABLE]
thus setting,
[TABLE]
it follows that,
[TABLE]
Now, the Taylor expansion of with Lagrange remainder gives
[TABLE]
where . Letting finishes the proof. ∎
Lemma 4 further entails, as shown next, that for any there exists , small, such that is also very unlikely.
Lemma 5**.**
For any and all , there exists , with , such that
[TABLE]
where , and where . Therefore, letting
[TABLE]
it follows that,
[TABLE]
where .
Proof.
Let have cardinality . Clearly, there are such subsets . Now fixing the values of at the indices belonging to , there are such agreeing on . Therefore,
[TABLE]
From (3.4),
[TABLE]
Collecting the above estimates,
[TABLE]
Since
[TABLE]
then
[TABLE]
Therefore, (3.6) becomes
[TABLE]
and it is enough to choose
[TABLE]
to obtain the stated result. ∎
Lemma 6 and Lemma 7, presented next, formalize our contradictory argument asserted above. To show that it is very unlikely that “the ratio is at most ”, note, at first, that for ,
[TABLE]
Specifically, when , see [3],
[TABLE]
Now, choose such that
[TABLE]
and let us show that very likely is larger than . To do so, let
[TABLE]
and
[TABLE]
Lemma 6**.**
There exist constants , such that
[TABLE]
Proof.
Divide the sequences and into subsequences of length 2, as given in the previous lemma. Then, by superadditivity, , where is the length of the longest common subsequence between and . Clearly, by the i.i.d. assumptions, is constant. Hence for ,
[TABLE]
Now let , it is easy to see that is smooth in , and that
[TABLE]
for every . Hence,
[TABLE]
for a suitable . Thus,
[TABLE]
Now, let , let , and so
[TABLE]
Since , one can choose . Hence,
[TABLE]
Choosing , and , we have,
[TABLE]
∎
We now finish our argument showing that, with high probability, any contains a non-vanishing proportion of unmatched letters. To do so, let
[TABLE]
be the event that any pair of matching subsequences has a proportion at least of unmatched letters, and let
[TABLE]
Above, is the number of unmatched letters, since is the position of the last matched letter, while is the number of matched letters.
Lemma 7**.**
Let be small enough such that , as given in (3.7), satisfies
[TABLE]
where is as in (3.10). Then, for all ,
[TABLE]
and thus
[TABLE]
Proof.
Let . In order to prove (3.15), we show that if does not hold while does hold, then does not hold either. Let . If does not hold, than the proportion of unmatched letters of is smaller than , i.e.,
[TABLE]
where . (Note that , since is of maximal length.) Therefore,
[TABLE]
Now, when holds, then
[TABLE]
Comparing (3.17) with (3.18) and noting that the (random) map is increasing, yield
[TABLE]
and thus
[TABLE]
Hence, from (3.14),
[TABLE]
which implies that cannot hold. ∎
As an example, when ,
[TABLE]
and therefore,
[TABLE]
In order to estimate the event , we need to show that the unmatched letters of do not concentrate in a small number of matches of . From the minimality of , the unmatched letters of a match of contain at most one compartment.
Let be the total number of letters in the sequence contained in a compartment of length at least , and let,
[TABLE]
where again is given via (3.10).
Lemma 8**.**
For any , there exist a positive integer , and positive constant and depending on , such that
[TABLE]
Proof.
Let be the number of integers such that
[TABLE]
It is easy to check that
[TABLE]
Let now , , be equal to 1 if and only if (3.20) holds, and 0 otherwise. Clearly,
[TABLE]
To estimate the sum (3.22), decompose it into subsums of i.i.d. random variables where
[TABLE]
so that
[TABLE]
Then, from (3.21)
[TABLE]
since in (3.23) at least one of the summands has to be larger than . Now, the appearing in the subsum are i.i.d. Bernoulli random variables with
[TABLE]
Therefore,
[TABLE]
with for . Take , then . Thus it is enough to choose such that
[TABLE]
Let , , we next show that,
[TABLE]
does satisfy (3.26), or equivalently that . With the choice in (3.27), is equivalent to , which is true since
[TABLE]
Choosing and , we have
[TABLE]
∎
We can now find a suitable such that when , and all hold, then (which depends on , see (3.2)) also holds.
Lemma 9**.**
Let be as in Lemma 7, let be such that , and let
[TABLE]
Then, for ,
[TABLE]
and thus
[TABLE]
Proof.
We prove (3.28), from which (3.29) immediately follows. On , each has at least unmatched letters. But,
[TABLE]
When holds,
[TABLE]
Since , (3.30) and (3.31), together imply that the number of unmatched letters of is at least By , there are at most letters contained in compartments of length at least . Thus, there are at least unmatched letters contained in compartments of length less than . But, every match of contains unmatched letters from only one compartment, and as such every match can contain at most unmatched letters from compartments of length less than . Therefore, these unmatched letters which are not in , must fill at least matches of . Hence, has at least non-empty matches. ∎
Combining Lemma 7 and Lemma 9 gives,
[TABLE]
which via (3.5), (3.11), and (3.19) entails
[TABLE]
Next, recalling the definition of in (2.3), observe that
[TABLE]
The next result estimates the first probability, on the above right hand side, and, therefore, completes the proof of Theorem 2.
Lemma 10**.**
Let , then
[TABLE]
Proof.
Let given as in Lemma 9 be at most 1, and let , so that . Let
[TABLE]
From (3.3), it follows that:
[TABLE]
where denote the -field generated by the and , namely,
[TABLE]
Moreover, is equal to zero or one (since is non-decreasing on ) and is also -measurable. Let
[TABLE]
Note that when holds, then
[TABLE]
for all . Define
[TABLE]
and
[TABLE]
When holds, then has a slope of one on the domain . Therefore, since , the slope condition of holds on the domain . When holds, then and are equal. Therefore, when and both hold, then the slope condition of is verified on the domain . Hence,
[TABLE]
and thus
[TABLE]
It only remains to estimate . First,
[TABLE]
Then, from Hoeffding’s exponential inequality, for any ,
[TABLE]
With the help of (3.32), and since , by choosing , (3.36) becomes
[TABLE]
for all . Then, note that there are at most terms in the sum in (3.35). Thus (3.35) and (3.37) together imply that
[TABLE]
∎
4 Estimation of the Constants
To estimate in (1.3), we need to first estimate various constants.
First let . Next, to estimate , the right hand side of (2.9) needs to be lower bounded. When , (2.14) gives that
[TABLE]
Therefore, any satisfying is fine. Choosing , then
[TABLE]
To estimate and in (2.4) requires upper bounds on , , and lower bounds for , , . As shown after Lemma 7, we can choose , then
[TABLE]
and
[TABLE]
Lemma 6 gives
[TABLE]
and
[TABLE]
Lemma 8 gives
[TABLE]
and
[TABLE]
Therefore, one can take and . Then, for ,
Note that when , we also have . Let
[TABLE]
and let
[TABLE]
then one can choose in (1.3).
5 Concluding Remarks
- •
The results of the paper show that we can approach as closely as we want the uniform case and have a linear order on the variance of . However, the lower order of the variance in the uniform case is still unknown although numerical results, see [14], leave little doubt that the variance is linear in the length of the words. (Unfortunately, the estimates of the previous section, on in (1.3), converge to zero as .)
- •
Combining the above results with techniques and results presented in [9], the upper and lower bound obtained above can be generalized to provide estimates of order , , on the centered -th moment of .
- •
Finally, the above results might also be extended to the general case where the letters of one sequence are taken with probability , , where and , while for the other sequence the first letters are taken with probability and the extra letter is taken with probability . Then many of the lemmas remain true replacing by or . For example, in the heading of Section 3.1 and Section 3.2, in (3.1), (3.3), (3.8), and Lemma 10, the can be replaced by . In (3.4) of Lemma 4, and in the definition of in Lemma 5, the term would have to be replaced with
[TABLE]
However, some constants that needs delicate estimations, such as , could be a further research topic.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Saba Amsalu, Christian Houdré, and Heinrich Matzinger. Sparse Long Blocks and the Variance of the Longest Common Subsequences in Random Words. ar Xiv:1204.1009 v 2 [math-ph] , September 2016.
- 2[2] Federico Bonetto and Heinrich Matzinger. Fluctuations of the Longest Common Subsequence in the Asymmetric Case of 2- and 3-Letter Alphabets. Latin American Journal of Probability and Mathematical Statistics , 2:195–216, 2006.
- 3[3] Vacláv Chvátal and David Sankoff. Longest Common Subsequences of Two Random Sequences. Journal of Applied Probability , 12(2):306–315, 1975.
- 4[4] Vacláv Chvátal and David Sankoff. An Upper-bound Technique for Lengths of Common Subsequences. In David Sankoff and Joseph Kruskal, editors, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison . Addison-Wesley, Reading, Massachusetts, 1983.
- 5[5] Vladimír Dancík. Expected Length of Longest Common Subsequences . Ph D thesis, 1994.
- 6[6] Joseph G. Deken. Some Limit Results for Longest Common Subsequences. Discrete Mathematics , 26(1):17–31, January 1979.
- 7[7] Joseph G. Deken. Probabilistic Behavior of Longest-Common-Subsequence Length. In David Sankoff and Joseph Kruskal, editors, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison . Addison-Wesley, Reading, Massachusetts, 1983.
- 8[8] Ruoting Gong, Christian Houdré, and Jüri Lember. Lower Bounds on the Generalized Central Moments of the Optimal Alignments Score of Random Sequences. Journal of Theoretical Probability , pages 1–41, December 2016.
