On the longest common subsequence of Thue-Morse words
Joakim Blikstad

TL;DR
This paper investigates the length of the longest common subsequence between Thue-Morse words and their complements, establishing that this length approaches the full sequence length as the sequence grows large.
Contribution
The paper provides new lower bounds on the longest common subsequence length, demonstrating it approaches the sequence length asymptotically, and generalizes results to any prefix of the Thue-Morse sequence.
Findings
Longest common subsequence length approaches the sequence length as n increases
Constructed explicit common subsequences for lower bounds
Generalized bounds to any prefix of Thue-Morse sequence
Abstract
The length of the longest common subsequence of the 'th Thue-Morse word and its bitwise complement is studied. An open problem suggested by Jean Berstel in 2006 is to find a formula for . In this paper we prove new lower bounds on by explicitly constructing a common subsequence between the Thue-Morse words and their bitwise complement. We obtain the lower bound , saying that when grows large, the fraction of omitted symbols in the longest common subsequence of the 'th Thue-Morse word and its bitwise complement goes to . We further generalize to any prefix of the Thue-Morse sequence, where we prove similar lower bounds.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
On the longest common subsequence of Thue-Morse words
Joakim Blikstad
University of Waterloo, Canada
Abstract
The length of the longest common subsequence of the ’th Thue-Morse word and its bitwise complement is studied. An open problem suggested by Jean Berstel in 2006 is to find a formula for . In this paper we prove new lower bounds on by explicitly constructing a common subsequence between the Thue-Morse words and their bitwise complement. We obtain the lower bound , saying that when grows large, the fraction of omitted symbols in the longest common subsequence of the ’th Thue-Morse word and its bitwise complement goes to [math]. We further generalize to any prefix of the Thue-Morse sequence, where we prove similar lower bounds.
keywords:
Thue-Morse sequence , Longest common subsequence , Combinatorial problems
††journal: Information Processing Letters
1 Introduction
The Thue-Morse sequence is a well known sequence in mathematics and computer science, with many interesting properties.
The Thue-Morse sequence has a lot of self-symmetry in it, but is at the same time cube-free and overlap-free (for a more in depth introduction to the Thue-Morse sequence, see, for instance, Allouche and Shallit [1]).
In 2006, Jean Berstel [2] formulated the problem of finding the length of the longest common subsequence between the ’th Thue-Morse word and its bitwise complement. By bitwise complement we mean replacing [math] with and with [math]. This paper primarily studies (sequence A297618 on the Online Encyclopedia of Integer Sequences [3]). Since the Thue-Morse words are prefixes of length for some , of the Thue-Morse sequence, a natural generalization is to consider other length prefixes of the Thue-Morse sequence. This paper also studies , the longest common subsequence between the length prefix of the Thue-Morse sequence and its bitwise complement (sequence A320847).
Example 1.1**.**
The first few values of and are:
[TABLE]
To show a lower bound for , it suffices to construct a common subsequence of the Thue-Morse words and their bitwise complements. This is what is done in this paper, using the symmetries of the sequence. In particular, we provide a recursive construction for such a common subsequence, which has length at least .
This new lower bound is interesting as it means that goes to , that is when grows large the longest common subsequence will only omit a vanishingly small fraction of symbols.
2 Setup
There are many equivalent definitions of the Thue-Morse sequence and Thue-Morse words. We will define them using morphisms.
Definition 2.1**.**
A morphism over an alphabet is a function that satisfies (concatenation) for all . Note that this means is uniquely defined by its behaviour on .
Definition 2.2**.**
Let denote the morphism on defined by and .
There are some basic properties that follow directly from the definition.
Proposition 2.1**.**
If then
* where denotes taking the bitwise complement of (i.e., swapping 0s and 1s).* 2. 2.
. 3. 3.
. 4. 4.
* and .*
Proof.
(i) follows from the symmetry (between [math] and ) in the definition of . (ii) holds for all morphisms. (iii) follows from an induction argument since for every binary string . (iv) can be seen from . ∎
Definition 2.3**.**
We call the ’th Thue-Morse word. We also say the Thue-Morse sequence, denoted by , is the the unique fixed point of (extended to the domain of infinite binary strings) beginning with a [math]. See Allouche et al. [1] for why such a fixed point exists and is unique.
Definition 2.4**.**
Denote by the length of the longest common subsequence of and . Similarly, denote by the length of the longest common subsequence of the prefix of length of the Thue-Morse sequence and its bitwise complement.
Example 2.1**.**
The first few Thue-Morse words are
[TABLE]
The Thue-Morse sequence starts as follows
Remark*.*
The Thue-Morse words are sometimes defined by the recurrence relation in Proposition 2.1 part (iv), and then the Thue-Morse sequence as the infinite application of this rule. We see that ’th Thue-Morse word is the prefix of length of the Thue-Morse sequence. This also means that .
We also need the following proposition, for which the proof can be found in [1].
Proposition 2.2**.**
If are the symbols of the Thue-Morse sequence we have and for all . Moreover, equals the parity of the number of “1” bits in the binary representation of .
Corollary 2.3**.**
The ’th digit of is the same as the ’th digit of (where we use zero-indexing).
Proof.
The ’th digit of is , and the ’th digit of is , by the above proposition. ∎
3 Construction of a common subsequence
We are now ready for a construction of a common subsequence between and when is a power of . We call this common subsequence , and define it recursively.
When , and we define , a subsequence of and .
- 2.
For , will be defined recursively as follows.
Let and . Say and , that is, we are constructing as a common subsequence of and . Write and as concatenations of blocks of size (since this is possible), say
[TABLE]
Since , each is one of or . Similarly each is one of or . It is also worth noting that if the ’th digit of is , and similarly if the ’th digit of is .
Now we compare to for , and find a common subsequence between them.
- (a)
When is even, by Corollary 2.3, so we take .
- (b)
When is odd, either and are the same, or one is and the other is . If they are the same we take , otherwise .
We then let be the concatenation of the ’s.
Example 3.1**.**
The common subsequence , and are underlined below:
[TABLE]
Remark*.*
is not necessarily the longest common subsequence. For example
[TABLE]
is the longest common subsequence between and , which has length , while .
4 Analysis of length
In this section we analyse the length of the common subsequence constructed in the previous section.
Definition 4.1**.**
For an integer , let be the number of symbols omitted by the common subsequence .
Remark*.*
, as .
When constructing , all the even indexed blocks (of size ) in are chosen to be in . So only the odd indexed blocks can contribute to . The last block will be completely omitted, and for the other blocks in odd positions we either miss if matching with recursively, or miss nothing if choosing to include the complete block. This leads us to the following lemma.
Lemma 4.4**.**
For every integer
[TABLE]
Proof.
The last block has size , and there are other odd indexed blocks, and in each we miss at most . So the lemma follows from the above discussion. ∎
We are now ready to prove an upper bound on .
Lemma 4.5**.**
For every integer , .
Proof.
We proceed by induction on .
The inequality clearly holds for since
Now suppose the inductive assertion holds for , that is . Using Lemma 4.4 and the induction hypothesis we have
[TABLE]
Note that for all integers , since for all integers . Thus
[TABLE]
This concludes the induction proof. ∎
By Lemma 4.4 it follows that for all . This means that the length of our constructed common subsequence of and where must be at least . This proves the following theorem.
Theorem 4.6**.**
For and :
[TABLE]
5 Extension to all
Up to this point we have only considered the common subsequence of and where for some . We wish to extend our construction to work for arbitrary .
If and , then say for some integer . Write
[TABLE]
This is saying that () can be written as blocks, where each block is either or . We can concatenate copies of the subsequence to obtain a common subsequence of and , i.e., we use our previous construction for each of the blocks independently. Using Theorem 4.6 we see that the length of this common subsequence is at least , since by choice of . We thus get a similar result as Theorem 4.6 for arbitrary .
Theorem 5.7**.**
For every , there exists a common subsequence between and with length at least
[TABLE]
Corollary 5.8**.**
, or more generally .
We can generalize the result further to all prefixes of the Thue-Morse sequence. Let be the prefix of length of the Thue-Morse sequence, and its bitwise complement. Based on the binary representation of the number , and can be split up into at most blocks, each with a size which is a power of . We will assume the blocks are in order of decreasing size, so that a block of size is either or . Then common subsequences satisfying the inequality in Theorem 5.7 for these blocks can be concatenated to form a common subsequence between and . To bound the length of this common subsequence we use the following lemma:
Lemma 5.9**.**
* for all .*
Proof.
We prove the inequality by induction on .
For we have , and for we have .
Now suppose and . This means that
[TABLE]
which concludes the induction proof. ∎
Now we continue to analyse the common subsequence between and . This subsequence omits at most symbols for the block of size (by Theorem 5.7). There is at most one block of size for each . The potential block of size will miss at most one symbol. Hence at most
[TABLE]
symbols are omitted, which by Lemma 5.9 is at most
[TABLE]
This proves the following theorem.
Theorem 5.10**.**
For all , there exists a common subsequence between and with length at least
[TABLE]
Corollary 5.11**.**
, or more generally .
6 Strengthening the analysis
The constructed common subsequence , and the generalizations in the previous section, does in fact have a slightly better asymptotic behaviour than what was proven in Section 4.
The previous length analysis was based on Lemma 4.4 which states that . This inequality is only tight when all for odd , using the same notation as in Section 3. However, we can get a better bound on in terms of by estimating how many of the blocks and are equal for odd .
Lemma 6.12**.**
If are the digits of the Thue-Morse sequence, then if and only if written in binary ends with a block of ’s with odd length.
Proof.
We use Proposition 2.2. if an only if and have the same number of “” bits modulo 2, when written in binary. This condition is equivalent to ending with a block of ’s of odd length when written in binary. ∎
Lemma 6.13**.**
Let . Then
[TABLE]
Proof.
For a fixed , we count how many -bit numbers (except ) which ends with a block of ’s of odd length. We can fix the -bit number to end with a “[math]” followed by “”s, for different values of , and then have possibilities for the leading digits. This works as we do not wish to count , which is the unique -bit binary number with all “1”s.
If is even .
- 2.
If is odd, then . ∎
By Proposition 2.2 we see that
[TABLE]
By Lemma 6.13 we thus know that when constructing , exactly of the odd indexed blocks will already be equal. Hence exactly of the pairs will need to be recursively matched using . This leads to the following improved version of Lemma 4.4:
Lemma 6.14**.**
For every integer ,
[TABLE]
Remark*.*
From the above lemma, we can solve for exactly. The first few values for are:
[TABLE]
Corollary 6.15**.**
Let . For every integer , .
Proof.
If , we have by the lemma
[TABLE]
∎
By a similar induction proof as in Lemma 4.5 we get a new upper bound on .
Theorem 6.16**.**
Let . For every integer , .
Proof.
We proceed by induction on .
It is easy to verify that the inequality holds for .
Now suppose the inductive assertion holds for , that is . Using Corollary 6.15 and the induction hypothesis we have
[TABLE]
since when . This concludes the induction proof. ∎
This means that the length of the common subsequence is
[TABLE]
This asymptotic behaviour propagate through the other generalizations, and we obtain a slightly better versions of Corollaries 5.8 and 5.11.
Theorem 6.17**.**
* and where .*
7 Acknowledgment
I thank Jeffrey Shallit for telling me about the problem.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Allouche et al. [1999] J.-P. Allouche, J. Shallit, The ubiquitous Prouhet-Thue-Morse sequence. Sequences and Their Applications: Proceedings of SETA ’98 , Springer-Verlag, 1999, pp. 1-16
- 2Jean Berstel [2006] Jean Berstel, Combinatorics on Words Examples and Problems. http://www-igm.univ-mlv.fr/~berstel/Exposes/2006-05-24Turku Cow.pdf (2006)
- 3[3] N. J. A. Sloane, Online Encyclopedia of Integer Sequences. http://oeis.org
