Burrows-Wheeler transformations and de Bruijn words
Peter M. Higgins

TL;DR
This paper explores the extended Burrows-Wheeler transform, linking it to permutations and semigroups, and applies it to generate de Bruijn words by inverting the transform.
Contribution
It provides a new perspective on the extended Burrows-Wheeler transform using permutation theory and demonstrates its application in generating de Bruijn words.
Findings
Established a link between the extended BWT and cyclic semigroups.
Provided a method to generate de Bruijn words via inverting the transform.
Linked syntactic semigroups to the extended BWT.
Abstract
We formulate and explain the extended Burrows-Wheeler transform of Mantaci et al from the viewpoint of permutations on a chain taken as a union of partial order-preserving mappings. In so doing we establish a link with syntactic semigroups of languages that are themselves cyclic semigroups. We apply the extended transform with a view to generating de Bruijn words through inverting the transform.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicssemigroups and automata theory · Algorithms and Data Compression · Natural Language Processing Techniques
The Semigroup of a Word
Peter M. Higgins & Norman R. Reilly
Peter M. Higgins, University of Essex, U.K
Burrows-Wheeler transformations and de Bruijn words
Peter M. Higgins & Norman R. Reilly
Peter M. Higgins, University of Essex, U.K
Abstract
We formulate and explain the extended Burrows-Wheeler transform of Mantaci et al from the viewpoint of permutations on a chain taken as a union of partial order-preserving mappings. In so doing we establish a link with syntactic semigroups of languages that are themselves cyclic semigroups. We apply the extended transform with a view to generating de Bruijn words through inverting the transform. We also make use of de Bruijn words to facilitate a proof that the maximum number of distinct factors of a word of length has the form .
1 Introduction
1.1 Definitions and Example
The original notion of a Burrows-Wheeler (BW) transform, introduced in [2], has become a major tool in lossless data compression. It replaces a primitive word (one that is not a power of some other word) by another word of the same length over the same alphabet but in a way that is generally rich in letter repetition and so lends to easy compression. Moreover the transform can be inverted in linear time; see for example [3]. Unfortunately, not all words arise as Burrows-Wheeler transforms of a primitive word so, in the original format, it was not possible to invert an arbitrary string. The extended BW transform however does allow the inversion of an arbitrary word and the result in general is a multiset (a set allowing repeats) of necklaces, which are conjugacy classes of primitive words. This was first explicitly introduced in [8] by Mantaci et. al. based on the bijection between these two collections first enunciated by Gessel and Reutenauer in [5].
In this opening section we will explain and prove the existence of the extended transform in a fashion that emphasises the approach whereby a permutation on a finite chain is expressed as a disjoint union of one-to-one partial order-preserving mappings.
**Notation and Background **The underlying base set for our mappings will be the finite chain . As usual will stand for the free monoid over , which is simply the set of all words, or strings, over the alphabet together with the empty word , although throughout this paper we assume a fixed order for . The free semigroup is denoted by . For emphasis, we sometimes denote equality of by . The set of letters that occur at least once in is known as the content of , denoted by . Following [8] we shall denote the first and last letters of a word respectively by and . In general, the th letter of a word is written as . The number of instances of the letter in a word will be denoted by , while the length of is written . We say that is primitive if is not a power of some other word. A word is a *factor *of if ; is an -factor of if additionally . We call a *prefix (respectively suffix) *of if (respectively ). A subword of is any word that may be formed by deletion of some of the letters of ; it follows that the factors of represent a special class of subwords of .
A standard text for results concerning combinatorics on words is [7] in which may be found proofs for simple unproved assertions concerning roots and conjugates that follow. If we say that is a conjugate of . The relation on whereby if is a conjugate of is an equivalence relation on . In the case of a primitive word , the equivalence classes of are known as *necklaces, *and we denote the necklace of a word by ; the *length *of is , which is also the cardinal of the necklace as is primitive. The first word of in the lexicographic order is known as its Lyndon word. A border of a word is word such that . No Lyndon word has a border (see Proposition 2.2(iii)).
The root of a word is the shortest factor root of such that for some . Two words and commute in if and only if they share a common root, which is in turn equivalent to the condition that and have a common power. The number of distinct conjugates of a word equals the length of root and root, the root of a conjugate of , is a conjugate of root.
For a word we denote the infinite one-sided word by with the notion of factor extending in the obvious way. Note that if and only if root root. The factors of of finite length are the *power factors *of ; a power factor for which is a cyclic factor of : equivalently is a factor of some conjugate of .
The interval of a chain is the subset . A mapping , the domain and range of which are both subsets of , is order-preserving if when and are both defined, satisfies the condition:
[TABLE]
We shall frequently use the action notation, as opposed to juxtaposition when the symbol on the right is a function and not a product in (although a central dot is also used at times simply as a visual separator within a word). Mapping composition is written from left to right. Here we write to denote the (inverse) semigroup of all partial one-to-one mappings on , and we denote the (inverse) subsemigroup of all order-preserving members of by .
**Example 1.1 **We give a example, following [8], that illustrates how to effect the bijection from multisets of necklaces to words and how to reverse this process. Let our alphabet be and let the set of Lyndon words of our necklaces be . Consider the collection of all words of the form , where and is the least common multiple of the lengths of the words of : in this instance . All these words then have common length . We order this set of words lexicographically to yield, in our example, the following array.
[TABLE]
The Burrows-Wheeler transform of is then the word formed by the th column of the table, read from the top, which in this case gives . The word is also formed by the list of last letters : both renditions of are highlighted in bold in the table. In [8] was defined by the letters . Their definition was also framed in context of the infinite table of rows , which simply consists of the table of the first columns of , as defined above, repeated infinitely often. However, as explained in [8], the table does not need to be extended to columns in order to determine the order of the rows: by a theorem of Wilf and Fine on word periodicity, the order of two rows that are respective powers of the root words and matches the lexicographic order of their prefixes of length k=|u|+|v|-\mbox{gcd(|u|,|v|)} (and this bound is tight). Hence the number of columns required in order to determine the row order of the table is always less than the sum of the lengths of the longest two necklaces of the multiset. The formal use of the lcm here allows us to define as a specified column of the table, which is a conceptual convenience used in our proofs. The stipulation that the words of be primitive is necessary in order that the transform be one-to-one. Note that the roots of the words are not in lexicographic order: the root precedes the root in the table. However the Lyndon roots do appear in lexicographic order: both lexicographically and in the rows of the table (see Theorem 1.2.13).
We recover the set from by way of the so-called *standard permutation *. To construct , take the first column of the table, which consists of the content of the words of arranged in alphabetical order with the number of occurrences of a letter equal to the number of instances of that letter among the Lyndon words of . In our example the column of first letters forms the word . The permutation, is then the union of a collection of partial one-to-one and order-preserving mappings, one for each member of . In this case ; the domain and range of is defined respectively by the positions of the instances of the letter in and respectively. Since is one-to-one and order-preserving, is defined uniquely by its domain and range, and of course is defined in the same fashion, and so on for any remaining letters in In our example we obtain:
[TABLE]
with dom and dom . The cardinality of is equal to the number of cycles in the disjoint cycle representation of , which here is . We may retrieve the Lyndon word of the multiset corresponding to each cycle of by simply replacing each integer in the cycle by the letter such that dom . In our case this means that we write whenever we see a number from [math] to and we write otherwise. In this way we recover .
1.2 Establishing the transform through partial order-preserving mappings
Using Example 1.1 as a guide, we formally define the Burrows-Wheeler transform and explain its inversion.
Definition 1.2.1 (Conjugation Map) Let be the mapping whereby .
Proposition 1.2.2 The Conjugation Map has the following properties:
(i) is a permutation on ;
(ii) if is closed under conjugation then permutes .
(iii) Suppose that S\subseteq aA^{n}$$(a\in A,n\geq 0). Then acts in an order-preserving manner on .
(iv) For any word with root, is the least positive integer such that .
Proof (i) is clear from the definition and (ii) follows from (i) as the given condition ensures that is closed under both and . To see (iii) suppose that with . Since , it follows that whence and so is order-preserving on the set . As for (iv), if then so in particular . Suppose that so that say. Then and since but as is primitive, it follows that .
Definitions 1.2.3 (Burrows-Wheeler map) Let denote the set of all finite multisets of necklaces over . Let BW:{\cal M}$$\rightarrow A^{*} denote the Burrows-Wheeler map, the action of which is defined as follows. Take any so that and let be the least common multiple of the lengths of the . Sort by lexicographic order the collection of powers , where is a word of the necklace The table is then a dictionary of words of common length . The word is then the final column, read from top to bottom, of . (Conventionally, maps the empty set to the empty word.)
**Definition 1.2.4 *(Standard permutation of a word) ***Let and let be the rearrangement of the letters of in lexicographic order. For each letter we define a partial one-to-one order-preserving mapping through specifying dom and ran as follows: dom is the interval of length corresponding to the positions occupied by in while ran is the set of positions occupied by in . The standard permutation of is then .
**Remark 1.2.5 **For any there is a unique such that . For any , we may define . We note that and for any and there is a unique word of length such that is defined.
**Proposition 1.2.6 **[9, Proposition 10] Let as in Definition 1.2.3, let the set of words that form the rows of be denoted by and let ). Let be the standard permutation of . Then the mapping is the restriction of the conjugation map to .
Proof Suppose that and that is the th word of . Then the th instance of in the first column of occurs in row . Hence, regarded as intervals of , dom dom . Similarly, since is the final column of , ran . Therefore since and are order-preserving mappings (the latter by Proposition 1.2.2(iii)) with common domain and range, they are equal. Since this is true for all letters , we infer that in that if and only if under .
The following was observed in [3], at least for the case of the BW transform of a single necklace.
**Proposition 1.2.7 **Let , let be the standard permutation and let . Then , which is to say that maps each column of to its predecessor column modulo , the number of columns of . In particular maps the first column of to the last.
Proof Let be a row of with so that say. Then by Proposition 1.2.6, . The letter will therefore be shifted one place back to appear in column and in row so that .
**Definition 1.2.8 *(Table of a word) ***Let and let be the standard permutation of . Let us write the cycle so is least such that and let denote the lcm of the cycle lengths. Define the table to be the table, the th row of which is the unique word such that is defined.
**Proposition 1.2.9 **Let and be as in Definition 1.2.8. Let be the length of and let be the corresponding prefix of , the th row of . Then
(i) is the root of ;
(ii) all conjugates of arise as roots of the rows of with multiplicity equal to that of .
(iii) The rows of are ranked lexicographically.
(iv) The final column of is .
Proof (i) By construction, and is the shortest prefix of with this property. In particular, it follows from this that . To show that is itself primitive, and so the root of , suppose to the contrary that for some . Then ; without loss suppose that . By applying to both sides of this inequality (remembering that is defined for all we infer that
[TABLE]
a contradiction. Hence and is the root of , as claimed.
(ii) Let be a conjugate of , the root of . Then
[TABLE]
and since is primitive, it follows that and is indeed the root of . This process associates each instance of the root with an instance of the conjugate in a one-to-one fashion, thereby matching the multiplicity of to that of each of its conjugates in the table
(iii) Let , let and be distinct words that occupy the respective rows and of and let be the longest common prefix of and so that and say. Then since is order-preserving we have . Since and have common length , it follows that say with . Moreover, since dom dom and , it follows that and so , as required.
(iv) Let denote the table . Then if and only if dom . In particular, taking gives that dom , whence dom . At the same time we observe that exactly when dom and therefore for all , whence is indeed the final column of .
**Definition 1.2.10 *(Inverse Burrows-Wheeler map) ***Define as follows. Given , form as in Definition 1.2.8. Let be the set of necklaces defined by the roots of the rows of . (With under .)
Theorem 1.2.11 [5, 8] The mapping of Definition 1.2.10 is the inverse Burrows-Wheeler transform .
*Proof *We first prove that for any , . Let be the table of and let as in Definition 1.2.3. We show that the th row of is the th row of . By Proposition 1.2.6, identifying the rows of with the chain allows us to say that . In particular the lcm of the cycle lengths of both permutations is a common value , and by Proposition 1.2.2(iv) is the lcm of the lengths of the roots of the words of so that is an array.
Now suppose that . By Proposition 1.2.6 it follows that , so that . Repeated application of this observation gives that so that is the unique word of length such that is defined. Hence say. By Definition 1.2.10, is the set of necklaces formed by the roots of , which is the set itself, and so .
Conversely, take any say and let . By Definition 1.2.10, is the collection of necklaces of the roots of the rows of . By Proposition 1.2.9(i), if is the root of row in , then is the length of the cycle of . It follows that there is a common value for the lcm of the lengths of the roots of the rows of (which is the row length of ) and the lcm of the cycle lengths of (which is the row length of ). By Proposition 1.2.9(ii), all members of appear as roots of rows of with equal multiplicity while by (iii) the rows of are ranked lexicographically. It follows from all this that is an array. Now is the final column of , which by Proposition 1.2.9(iv) is the word . We conclude that .
Remark 1.2.12 The first part of the previous proof establishes that while the third paragraph shows that so that the bijection between words and necklaces is through equality of the corresponding table . Moreover Proposition 1.2.6 shows that the action of on corresponds to that of on and Proposition 1.2.7 shows that acts to map each column of onto its predecessor modulo .
**Theorem 1.2.13 **Let , and let with two words in the set of rows of . Then root if root is Lyndon. In particular the Lyndon words appear in in lexicographic order.
Proof We prove the first statement by showing that if then root is not Lyndon. Given this claim, suppose that root and root are both Lyndon words such that . Then root root so that the Lyndon roots do indeed appear in lexicographic order in .
Since with root it follows that is not primitive and so for some . Since and we may write with and . If , then say whence , contrary to hypothesis and so whence, since is a power of , for some maximal , and where is a prefix of . It follows that where so that say whence . Taking the factorization , we see that is a conjugate of . We also have the factorization ), whence as , which implies that root root and so root is not Lyndon, as required.
2 Semigroup of the Burrows-Wheeler transform
Semigroup of a necklace
In [6] the author wrote about the semigroup generated by the letters acting by conjugation on the necklace of a primitive word . In particular the question of when two words and have isomorphic semigroups and was settled by Theorem 2.4 of [6]. The semigroup is exactly the semigroup generated by the partial mappings encountered above. We show here that is isomorphic to the syntactic semigroup of the cyclic semigroup generated by the word .
We begin with a fixed primitive word over the finite ordered alphabet . Consider the necklace , ordered lexicographically.
**Definition 2.1 **Identify the chain with the chain . The semigroup is the subsemigroup of generated by the set of partial mappings where .
In this section it is convenient to denote the mapping by so that the semigroup is generated by the set of partial mappings ( where dom if and only if so that say in which case . We write this using action notation as , allowing us to suppress the dash to the right of the central dot without introducing ambiguity. The free monoid acts on the right of in that for all and (taking to be the identity mapping). Note that depends only on the necklace and not its representative (and so we may assume that the Lyndon word of although this is not necessary). We make use of the following facts from Proposition 1.3 in [6]; part (iii) is well-known - see for example the text [7].
**Proposition 2.2 **Let and be an integer. Let be the prefix of of length so that t=mn+s$$(0\leq m,\,0\leq s\leq n-1). Write and define by . Then
(i) and ;
(ii) is the unique word of length such that is defined.
(iii) A Lyndon word has no border.
Proof (i) and (ii) are immediate consequences of the definition of the action of each letter on a given word. As for (iii), suppose to the contrary that for some . Then since is Lyndon (and primitive) we may apply (i) to infer that and . From the first of these inequalities we get as the latter is defined because . However we then obtain , which is a contradiction. Therefore has no border.
We now introduce a second realisation of via a certain syntactic congruence, thus producing without reference to mappings. (For background on syntactic semigroups and congruences see [11].) Let be the subsemigroup of of all positive powers of . Let be the syntactic congruence on generated by so that for :
[TABLE]
**Definition 2.3 **The semigroup .
**Lemma 2.4 **Let and be conjugate words. Then .
Proof** **Suppose that . Then for any we have that if for some then. Since this in turn implies that for some as , whence . Hence it follows that implies that . Interchanging the roles of and in this argument yields the conclusion that and by symmetry of the conjugation relation we see that the reverse inclusion also holds. Therefore and .
**Theorem 2.5 **For any primitive word ,
*Proof *For each , let be the corresponding member of and be that of . We show that a required isomorphism is given by the mapping . We first verify that if and only if , thereby showing that is an injective function. It is then clear from the definition that is also surjective and is a homomorphism as for any we then have
[TABLE]
To this end suppose that and suppose further that , where and that is defined. By Proposition 2.2(ii), for some , where say . We shall show that :
[TABLE]
where the second and fourth equalities are by Proposition 2.2(i). Then since we have:
[TABLE]
and so . Therefore since we infer that for some ( as .) Hence, by cancelling on the left and on the right of this equation we obtain:
[TABLE]
Invoking (2) and then (3) we infer that
[TABLE]
Since the mapping is injective, (4) allows us to deduce that . Since were arbitrary, it follows that implies that as the argument shows that for any , if one of is defined, then both are defined and are equal.
To prove the converse we next suppose that for some and suppose further that for some and . We verify that . The following argument will hold with the roles of and reversed and so this claim yields that if then , thus establishing that is a one-to-one mapping from into . Since we obtain(u\cdot p)\cdot x=(u\cdot p)\cdot y\Rightarrow((u\cdot p)\cdot x)\cdot q=((u\cdot p)\cdot y)\cdot q$$\Rightarrow u\cdot(pxq)=u\cdot(pyq)\Rightarrow u\cdot u^{m}=u\cdot(pyq); by Proposition 2.2(i) we infer that . By Proposition 2.2(ii), for some non-empty prefix of say. However then we obtain
[TABLE]
Hence and since is primitive it follows that , and so for some . In particular, , as required to complete the proof of the claim. Therefore is an isomorphism from to .
For a multiset of necklaces , we may define the semigroup in terms of the partial mappings of the standard permutation of .
**Theorem 2.6 **Let } be a multiset of necklaces and let . Let be the subsemigroup of generated by the set of mappings of . Then is a subsemigroup of isomorphic to a subdirect product of the syntactic semigroups .
Proof** **Let denote any member of the set of domains of disjoint cycles of . Since and each is a restriction of , it follows that is a (possibly empty) one-to-one and order-preserving mapping in , where inherits a linear order as a subchain of . The mapping whereby induces an injective homomorphism . Let denote the th projection mapping on so that . We see that is the image of (where dom ) with generators . It follows that may be regarded as an injective homomorphism of into . Finally, by Theorem 2.5, , the syntactic semigroup of and so we conclude that is isomorphic to a subdirect product of the syntactic semigroups of each of the languages , as required.
3 de Bruijn Words
In this section we take our alphabet to be , although we continue to refer to its members as *letters. *An interesting special case is where we take the BW transform of (the necklace of) a de Bruijn word of span over a finite -ary alphabet, which can be defined as a word of length for which every word of length appears exactly once as a cyclic factor of . For every and for every -ary alphabet , de Bruijn words exist and their number is [1].
**Definition 3.1 **A multiset of necklaces is a de Bruijn *set of span *** **over if and every is a prefix of some power of some word of the necklaces .
Remarks 3.2 The number of distinct prefixes of length of powers of the words of the necklaces is at most so, given that is a de Bruijn set of span , every word in can be read exactly once within the necklaces of . It also follows in particular that no two necklaces in are equal so that is indeed a set, as opposed to a multiset, of necklaces.
**Lemma 3.3 **Let be a de Bruijn set of span . Then contains a necklace of length at least .
Proof There exist Lyndon words of length (eg. take , where ). Let be a necklace of cardinal so that say with . Any prefix of length of a power factor of a word has a border of length if and has a border of length otherwise. Since is a Lyndon word, has no border by Proposition 2.2(iii), and so cannot arise as a prefix power of a word . Since is a prefix power of some word in some necklace of , it follows that contains a necklace of cardinal at least .
The bound of in Lemma 3.3 is tight: see Theorem 3.8 below. It follows from Lemma 3.3 that the length of the rows of the table is at least . Consider the sub-table consisting of the first columns of . Since is an span de Bruijn set, the rows of this sub-table form the dictionary of . Each is the prefix of successive rows of and if two of these rows ended with the same letter , then the images of these two rows under would both begin with , from which it would follow that would be a prefix of a power of two distinct words of the necklaces of , contrary to being a de Bruijn set of span . It follows that the final column of is a product of members (possibly with repetitions) taken from the set (that is, consists of all products of distinct members of ). These observations establish the forward implication in the following result.
Theorem 3.4 The set of all BW transforms of de Bruijn sets of span over a -letter alphabet is .
**Examples 3.5 **Let . We may write so that where . Take . The standard permutation is the transitive cycle
[TABLE]
yielding the span Lyndon de Bruijn word . As a second example take so that
[TABLE]
the corresponding set of Lyndon words is , the cyclic -factors of which are all the words of with arising from the necklace defined by the Lyndon word .
We prove the reverse implication in Theorem 3.4 via two lemmas.
**Lemma 3.6 **Let . Then , a union of order preserving partial mappings with dom for The sets ran also partition and each range set is itself a transversal of the partition of into the successive intervals of length which are:
[TABLE]
Proof The description of the sets dom follows from the fact that is the same value, , for each and the sets ran always partition the base set as is a permutation. The claim as regard transversals follows as each is a product of words from .
For any and integer there is a unique product with each , such that is defined. The product can therefore be identified with , which we shall call the* -string *of .
**Lemma 3.7 **Let be an -digit -ary expression . Then for any whose -digit -ary representation has as a prefix, the -ary -string of is in the standard permutation for every . Moreover, the domain of the partial mapping is the interval of all , the -ary representation of which begins with . In particular, dom .
Proof By Lemma 3.6, is defined if and only if and so the claim holds if . We shall now verify that has the -ary form , , from which the result follows by repeated application of this fact. Now since is order-preserving, it follows from Lemma 3.6 that we may identify the interval of (5) in which lies by putting , giving:
[TABLE]
and so , as required. By what we have just proved and the uniqueness of the products , the integer dom if and only if is a prefix of , whence the final claim follows.
Proof of Theorem 3.4**. **The forward implication was proved in the preamble to the theorem so consider the converse. For consider . By Lemma 3.7, for any , is the unique word such that is defined. Since some members such as , are primitive, the table has at least columns. It follows that the prefix of length of the row of is and so the sub-table of the first columns of has as its rows the members of written in numerical order. In particular occurs among the factors of length that can be read from the words of the necklaces of , and so each such must occur exactly once and therefore is a de Bruijn set of span .
We next look at the special case where is a power of .
**Theorem 3.8 **Let , let and let . Then the rows of are simply the list of numbers . Moreover is the set of necklaces of Lyndon words of length dividing . The Lyndon words of the roots of the necklaces of occur in the rows of in lexicographic order.
Proof As in the proof of Theorem 3.4, we see that the sub-table of the first columns of simply lists the numbers of . However since , for , is the th member of the specified interval in (5), that is to say, by repetition of this observation we infer that for any , the sequence (where ) is the cyclic sequence under of Since , it follows that the cardinal of the corresponding necklace is a divisor of ; in particular , the lcm of the length of the roots of words of the rows is , so that is simply the table of . The least member of each necklace is by definition a Lyndon word. Every Lyndon word of length dividing has a power which is some word and so occurs as a Lyndon word of some necklace in . The Lyndon roots of the words of occur in lexicographic order by Theorem 1.2.13.
**Example 3.9 **Let us take , and again reverting to the alphabet , we have and . Then
[TABLE]
[TABLE]
Expressed as a concatenation of Lyndon words of the corresponding necklaces we obtain:
[TABLE]
This is indeed the first de Bruijn word of span in the lexicographic order. That this is always the case is a well-known theorem of Frederickson and Maiorana. (See also [10] for an alternative proof.)
Theorem 3.10 [4] For a given , the lexicographic concatenation of all Lyndon words of length dividing is the de Bruijn word of span that lies first in the lexicographic order.
Corollary 3.11 Taken in ascending order of their Lyndon words, the concatenation of the Lyndon words of the necklaces of is the first de Bruijn word of span in the lexicographic order.
**Example 3.12 **Let us take so that say and calculate where We find that
[TABLE]
[TABLE]
and so the least de Bruijn word that contains all words of length over the alphabet as its set of cyclic factors is the following concatentation of Lyndon words of lengths 1 or 3 over :
[TABLE]
4 Maximum number of distinct factors of a word
As an application of de Bruijn words we derive the functional form for the maximum number of distinct factors in of a word of length over a fixed finite alphabet . The upper bound in our result comes from observing that long words must have repeated short factors while the proof for the lower bound relies on the fact that factors of de Bruijn words have no repeats of their long factors. The topic of the number of subwords* *of a word has been extensively investigated: for example see Section 6.3 of [7].
Consider the finite alphabet . The set A^{\leq m}=\{w:w\in A^{+}\,\mbox{ and |w|\leq m}}. The number of distinct factors of will be denoted by .
Lemma 4.1 With repeats, the number of factors in of is .
Proof A factor of is determined by the choice of two distinct positions with each position occurring either between letters or at either end of . There are such pairs.
Corollary 4.2 For we have . Moreover, the lower bound is obtained if and only if and the upper bound is attained if and only if .
Proof The upper bound for comes from Lemma 4.1. Since any word has distinct prefixes it follows that always holds. If , then for some and the set of factors of is and is of cardinal On the other hand if then, in addition to its prefixes, also has the factor where so that . Next suppose that . Put ; no two factors of have the same content so the factors of are pairwise distinct, showing that the upper bound in the statement is attained in this case. For all remaining cases we have in which instance has two identical -factors and so .
In light of Corollary 4.2 we shall henceforth assume that .
Definition 4.3 Let max.
Theorem 4.4 .
Proof For , a word has (not necessarily distinct) -factors and . Hence there are at least repeated -factors in . Let be the greatest value of such that , noting that . The total number of repeated factors in is then at least:
[TABLE]
Now since we have ; by taking logarithms to the base we obtain so that Moreover, whence it follows that
[TABLE]
Conversely, given , let be determined by the inequalities . Take to be a factor of a de Bruijn word of span over . For any positive integer there are factors of length in . Moreover if , these factors are pairwise distinct as the members of the set of prefixes of length of these factors are pairwise distinct since is a de Bruijn word of index . Hence
[TABLE]
[TABLE]
Now we have , whence and so (8) yields:
[TABLE]
Combining (7) and (9) we conclude that
Acknowledgement I would like to thank the referees for their constructive suggestions and Alexei Vernitski for pointing out the connection between the standard permutation and syntactic semigroups.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] van Aardenne-Ehrenfest T. and de Bruijn N.G., Circuits and trees in oriented linear graphs , Simon Stevin 28 : 203-217 (1951).
- 2[2] Burrows M. and Wheeler D.J., A block sorting data compression algorithm, Technical Report, DIGITAL System Center, (1994).
- 3[3] Crochemore M.J., Désarménien J. and Perrin D., A note on the Burrows-Wheeler transformation , Theoretical Computer Science, Vol. 332 Issue 1-3, 2005, 567-572.
- 4[4] Fredricksen H. and Maiorana J., Necklaces of beads in k colors and k-ary de bruijn sequences , Discrete Math. 23 (1978), 207-210.
- 5[5] Gessel I.M. and Reutenauer C., Counting permutations with given cycle structure and descent set, J. Combin. Thry. Series A, 64 , 189-215.
- 6[6] Higgins P.M., The semigroup of conjugates of a word , International Journal of Algebra and Computation, Vol. 16 , No. 6 (2006), 1015-1029.
- 7[7] Lothaire M., ‘Combinatorics on Words’, Cambridge University Press, (2002).
- 8[8] Mantaci S., Restivo A., Rosone G. and Sciortino M., An extension of the Burrows-Wheeler Transform and applications to sequence comparison and data compression, in ‘Combinatorial Pattern Matching’, Lecture Notes in Computer Science, (2005), Vol. 3537/2005, 178-189.
