Tower-type bounds for unavoidable patterns in words
David Conlon, Jacob Fox, Benny Sudakov

TL;DR
This paper investigates the bounds on the length of words over finite alphabets that guarantee the presence of unavoidable patterns, specifically Zimin patterns, establishing tight tower-type bounds for these functions.
Contribution
It provides essentially tight tower-type bounds for the function determining the minimal word length containing Zimin patterns, advancing the quantitative understanding of unavoidable patterns in words.
Findings
Established tight bounds for the function f(n,q) for Zimin patterns.
Determined f(3,q) up to a constant factor as Θ(2^q q!).
Extended the understanding of unavoidable patterns in combinatorics on words.
Abstract
A word is said to contain the pattern if there is a way to substitute a nonempty word for each letter in so that the resulting word is a subword of . Bean, Ehrenfeucht and McNulty and, independently, Zimin characterised the patterns which are unavoidable, in the sense that any sufficiently long word over a fixed alphabet contains . Zimin's characterisation says that a pattern is unavoidable if and only if it is contained in a Zimin word, where the Zimin words are defined by and . We study the quantitative aspects of this theorem, obtaining essentially tight tower-type bounds for the function , the least integer such that any word of length over an alphabet of size contains . When , the first non-trivial case, we determine up to a constant factor, showing that .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Tower-type bounds for unavoidable patterns in words
David Conlon Mathematical Institute, Oxford OX2 6GG, United Kingdom. Email: [email protected]. Research supported by a Royal Society University Research Fellowship and by ERC Starting Grant 676632.
Jacob Fox Department of Mathematics, Stanford University, Stanford, CA 94305, USA. Email: [email protected]. Research supported by a Packard Fellowship, by NSF Career Award DMS-1352121 and by an Alfred P. Sloan Fellowship.
Benny Sudakov Department of Mathematics, ETH, 8092 Zurich, Switzerland. Email: [email protected]. Research supported in part by SNSF grant 200021-175573.
Abstract
A word is said to contain the pattern if there is a way to substitute a nonempty word for each letter in so that the resulting word is a subword of . Bean, Ehrenfeucht and McNulty and, independently, Zimin characterised the patterns which are unavoidable, in the sense that any sufficiently long word over a fixed alphabet contains . Zimin’s characterisation says that a pattern is unavoidable if and only if it is contained in a Zimin word, where the Zimin words are defined by and . We study the quantitative aspects of this theorem, obtaining essentially tight tower-type bounds for the function , the least integer such that any word of length over an alphabet of size contains . When , the first non-trivial case, we determine up to a constant factor, showing that .
1 Introduction
The term Ramsey theory refers to a broad range of deep results from various mathematical areas, like combinatorics, logic, geometry, ergodic theory, number theory and analysis, all connected by the fact that large systems contain unavoidable patterns. Examples of such results include Ramsey’s theorem in graph theory, Szemerédi’s theorem in number theory, Dvoretzky’s theorem in asymptotic functional analysis and much more.
In this paper, we study the appearance of such unavoidable patterns in words, where words and patterns are here defined to be strings of characters from distinct fixed alphabets. We say that a word contains the pattern if there is a way to substitute nonempty words, which need not be disjoint or even distinct, for the letters in so that the resulting word is a subword of , where a subword of is defined to be a string of consecutive letters from . Conversely, we say that avoids if does not contain .
For example, it is a simple exercise to show that every four-letter word over a two-letter alphabet contains the pattern , while Thue [13, 14] famously constructed an infinite word over a three-letter alphabet avoiding . This example alone has a surprisingly rich history [1, 4], being related, among other things, to work of Morse [10] on symbolic dynamics.
For a positive integer , we say that the pattern is -unavoidable if every sufficiently long word over a -letter alphabet contains a copy of . In the example above, where , is -unavoidable, but -avoidable. We say that the pattern is unavoidable if it is -unavoidable for all . The unavoidable patterns were characterised by Bean, Ehrenfeucht and McNulty [3] and, independently, by Zimin [15]. Zimin’s characterisation, which is particularly appropriate for our purposes, says that a pattern is unavoidable if and only if it is contained in a Zimin word.
The Zimin words are defined recursively: , , and, in general, , where is a new letter. As well as playing a central role in the study of unavoidable patterns in words, these words are important in the study of Burnside-type problems, showing up in Ol’shanskii’s proof of the Novikov–Adian theorem and, in a slightly different guise, in Zelmanov’s work on the restricted Burnside problem (see [12] for a thorough discussion).
It is natural and interesting to consider the quantitative aspects of Zimin’s theorem. Following Cooper and Rorabaugh [8], we let denote the smallest integer such that every word of length over an alphabet of size contains a copy of . It is a simple exercise to verify that and . For general , Zimin’s work gives an Ackermann-type upper bound for . However, a combination of recent results due to Cooper and Rorabaugh [8] and Rytter and Shur [11] gives the considerably better bound that, for and ,
[TABLE]
where the term in the topmost exponent does not depend on (in fact, it can be taken to be zero when is sufficiently large).
Our first result is a lower bound matching the upper bound when is sufficiently large in terms of .
Theorem 1.1
For any fixed ,
[TABLE]
In particular, for , this says that , a result we will prove by an appeal to the Lovász local lemma. A key observation here is that it is not enough to apply the local lemma to the uniform random model where every word of a given length occurs with the same probability (though an approach of this form is discussed in [8]). Instead, we make use of a non-uniform random model which separates all instances of any given letter.
For higher , there are two different ways to proceed, one based on generalising the local lemma argument discussed above and another based on an explicit iterative construction which allows us to step up from the -case to the -case for all . This is in some ways analogous to the situation for hypergraph Ramsey numbers, where the Ramsey numbers of complete -uniform hypergraphs determine the Ramsey numbers of complete -uniform hypergraph for all . The difference here is that we are able to determine very accurately, while the Ramsey number of the complete -uniform hypergraph remains as elusive as ever (see [7] for a thorough discussion).
This stepping-up method also allows us to address the weakness in Theorem 1.1, that is taken to be fixed. Indeed, after suitable modification, the method proves sufficiently malleable that we can prove a tower-type lower bound even over a binary alphabet. This is the content of the next theorem, which is clearly tight up to an additive constant in the tower height.
Theorem 1.2
[TABLE]
We also look more closely at the case. This has been studied in some depth before, with Rytter and Shur [11] proving that . We improve their result by a factor of roughly and show that this is tight up to a multiplicative constant.
Theorem 1.3
.
The paper is laid out as follows. For completeness, we will describe the simple proof of the upper bound on in the next section. In Section 3, we will show how the local lemma can be used to prove Theorem 1.1. We do this in two stages, first proving a lower bound for which is sufficient for iteration and then addressing the general case. In Section 4, we discuss the stepping-up technique, first showing how to complete the second proof of Theorem 1.1 via this method and then how to modify the approach to give Theorem 1.2. In Section 5, we prove Theorem 1.3, determining up to a constant factor. We conclude by discussing some further directions and open problems. Throughout the paper, we will use to denote the logarithm base . For the sake of clarity of presentation, we will also systematically omit floor and ceiling signs.
2 The upper bound
The proof of the upper bound has two components. The first is the following simple lemma, due to Cooper and Rorabaugh [8].
Lemma 2.1
.
Proof: Consider a word of length of the form
[TABLE]
That is, we have words of length , each separated by an additional letter. By the definition of , each such word contains a copy of . Since there are such copies, two of them must be equal. As these two copies are separated by at least one letter, this yields a copy of .
A naive application of Lemma 2.1 starting from already yields a bound of the form
[TABLE]
To improve the topmost exponent, we use the following refinement of Lemma 2.1, due to Rytter and Shur [11]. The method works for all , but for our purposes it will suffice to consider the case .
Lemma 2.2
.
Proof: Say that a word is -minimal if it contains but every subword avoids . If is -minimal, it is easy to check that either for a fixed letter or , where all of the are distinct and for all . Thus, the number of -minimal words over an alphabet of size is
[TABLE]
Now consider a word of length of the form
[TABLE]
Each word of length contains a -minimal word. Therefore, since there are words of length and only -minimal words, two of the corresponding -minimal words must be the same. This easily yields a copy of . Since
[TABLE]
the result follows.
The interested reader may wish to skip to Section 5, where we improve the estimate above to and show that this is tight up to a constant factor. For now, we continue to focus on the general case, combining Lemmas 2.1 and 2.2 to prove the required upper bound on .
Theorem 2.1
For and ,
[TABLE]
Proof: We will prove by induction on the stronger result that
[TABLE]
For the base case , the result follows from Lemma 2.2 since for . Writing , we will assume that , for some , and show that , from which the required result follows. By Lemma 2.1, we have
[TABLE]
and, therefore,
[TABLE]
as required.
3 Applying the local lemma
As an illustration of the main idea behind our proof, we will initially focus on the case , showing that . In order to state the version of the Lovász local lemma that we will need (see, for example, [2]), we say that a directed graph with is a dependency digraph for the set of events if for each , , the event is mutually independent of all the events .
Lemma 3.1
Suppose that is a dependency digraph for the events with all outdegrees at most . If for all and , then
[TABLE]
Theorem 3.1
.
Proof: We begin by splitting our alphabet arbitrarily into parts , each of size . We generate a random word by placing letters in a series of successive intervals , each of length , as follows: first, fill with a random permutation of the letters from ; then apply the same process in for each , that is, fill with a permutation of the letters from ; for interval we reuse the letters from , for interval we reuse the letters from and so on, where for the interval we reuse the letters from .
Note that, because of how we place the letters, for any two instances of the same letter, there are at least consecutive intervals of length between them. That is, every copy of has length at least and includes consecutive intervals . Therefore, in order to find a copy of in a word of this form, we must find two disjoint equal intervals of length consisting of intervals, each with the same permutations of length . We will now use the local lemma to show that there is a word of length containing no such pair and, thus, containing no copy of .
Suppose, therefore, that we have used the process described above to generate a random word of length . Let be the collection of events corresponding to the existence of two disjoint intervals of length , each consisting of of the intervals of length described above, containing the same subword. Note that any such pair of intervals of length will overlap with at most other such pairs of intervals of length . Indeed, there are at most ways to choose an interval of length overlapping with one of the two intervals forming the pair. For the other interval there are at most possibilities, each given by the first interval of length it contains.
Note that for each . Applying the local lemma, Lemma 3.1, with and , we see that since , there exists a word of length such that none of the events hold, as required. By the discussion above, this word contains no copy of , so the proof is complete.
We also note a slight strengthening of this result which will be useful in the next section. In the proof, we will freely use notation from the proof above.
Theorem 3.2
There are at least words of length over an alphabet of size such that avoids and there is a distinguished letter such that any subword of not containing the letter avoids .
Proof: The proof is almost exactly the same as the proof of Theorem 3.1, except we set aside the distinguished letter at the start, only using it immediately after each interval of the form to separate it from the interval . By construction, the word between any two successive instances of will consist of the intervals . But the union of these intervals contains no repeated letters and, hence, no copy of , as required.
To count the number of words, note that the number of possible -letter words generated by our random process is equal to , each occurring with the same probability. Since there are fewer than bad events , each of which is independent of all but of the others, the local lemma, Lemma 3.1, tells us that with probability at least none of these bad events happen, so the process generates an appropriate word. In fact, there must be at least
[TABLE]
appropriate words, completing the proof.
The remainder of this section will be concerned with generalising the proof of Theorem 3.1 to give a local lemma proof of Theorem 1.1. The reader who is willing to accept our word that such a generalisation is possible may skip to the start of the next section to see how a recursive procedure may also be used to finish the job. For the resolute, we state a more general form of the Lovász local lemma (see [2]).
Lemma 3.2
Suppose that is a dependency digraph for the events . If there are real numbers such that and for all , then
[TABLE]
First proof of Theorem 1.1: We will generate random words in the same manner as in the proof of Theorem 3.1. That is, we split our alphabet arbitrarily into parts , each of size , and generate a random word by placing letters in a series of successive intervals , each of length , as follows: first, fill with a random permutation of the letters from ; then apply the same process in for each , that is, fill with a permutation of the letters from ; for interval we reuse the letters from , for interval we reuse the letters from and so on, where for the interval we reuse the letters from . Once again, we note that any two instances of the same letter must be at least a distance apart, and so the shortest copy of has length at least .
Define and, for , . For each , we consider all bad events corresponding to the existence of two disjoint identical intervals of length appearing at distance at most from one another. If none of these events occur in a word generated as described above, we see, since every copy of in has length at least and any two identical intervals of length are at least apart, that every copy of in has length at least . In turn, since any two identical intervals of length are at least apart, this implies that every copy of in has length at least . Iterating, we see that every copy of in must have length at least . Hence, for to contain , it must have length at least , which is easily seen to satisfy the inequality
[TABLE]
It therefore remains to show that there exists an appropriate of length such that none of the bad events for occur. To apply the local lemma, we need to analyse the dependencies between different events. Suppose, therefore, that and are fixed and we wish to determine how many of the events a particular depends on.
For , there are at most events that depend on . Indeed, one of the elements in the pair of intervals corresponding to must be equal to one of the endpoints from the pair of intervals corresponding to . There are choices for the endpoint and choices for which of the elements corresponds to this endpoint. Once these choices are made, they fix one of the intervals in the pair corresponding to and the other interval may be chosen arbitrarily within distance from the first one, so there are choices. A similar argument applies when to show that there are at most events that depend on .
To estimate , note that any interval of length will fully contain at least successive intervals of the form and, therefore,
[TABLE]
We will now apply the local lemma with for all events . By using that is fixed together with the inequality for , we see that
[TABLE]
and, therefore,
[TABLE]
We may therefore apply the local lemma to obtain the desired word, completing the proof.
4 Stepping up
We will begin this section by completing our second proof of Theorem 1.1. This is based on a simple recursion encapsulated in Lemma 4.1 below. To state this result, we need a few definitions.
Let denote the number of words over an alphabet of size which avoid . Note the inequality , which follows since the number of words over a -letter alphabet of length less than is . Let denote the set of all words over an alphabet of size which avoid and have a distinguished letter, say , such that any subword of not containing the letter avoids . We let denote the length of the longest word in and . By definition, and .
Lemma 4.1
[TABLE]
and
[TABLE]
Proof: Let denote the distinguished letter in the words in . Writing , consider any one of the orderings of the words in , say . For odd, let be obtained from by changing every to . For even, let be obtained from by changing every to . Add a new distinguished letter and consider the word formed by placing a between each and and concatenating the sequence. The number of letters in is , consisting of the original nondistinguished letters, the new letters replacing the old letter and the new distinguished letter . The number of possible choices for is , one for each ordering of the words in . Moreover, the length of is . It will therefore suffice to show that .
Note that any subword of which does not contain the distinguished letter is a subword of some and, since is a copy of a word in , it does not contain . It only remains to show that does not contain a copy of . Suppose for contradiction that it does and let this subword be , with a copy of . Neither contains two or more copies of the letter , since between any two consecutive copies of there is a unique word which cannot then appear in both copies of . If contains no , then each of the two copies of is a subword of a (not necessarily the same). However, no contains , contradicting the fact that is a copy of . So each contains exactly one . Write with a copy of . As contains exactly one , this copy of must be in and each copy of is entirely contained in a subword of of the form for some (which will be a different for the left and right copy of ). As and have different parity, the distinguished letter of is not in and the distinguished letter of is not in . Thus, the left and right copies of do not contain the distinguished letters of or of . However, and are both in , so these copies of cannot contain a copy of , contradicting the fact that is a copy of .
We may now complete our second proof of Theorem 1.1.
Second proof of Theorem 1.1: We will begin by proving inductively that
[TABLE]
for all . For , this follows from Theorem 3.2. For the induction step, we use Lemma 4.1 to conclude that
[TABLE]
which easily implies the required result. To complete the proof of the theorem, note that Theorem 3.1 handles the case , while, for , Lemma 4.1 and our bound on together imply that
[TABLE]
as required.
We now turn to the proof of Theorem 1.2. This is similar in broad outline to the proof of Theorem 1.1 above, where we produced words which are -free by concatenating a collection of -free words, separating them by instances of an extra distinguished letter. However, here, in order to avoid adding extra letters to our alphabet, we will instead separate our -free words with long strings of s. This alteration makes the proof considerably more delicate.
To proceed, we let denote the word consisting of ones and the largest set of binary words of the same length with the following properties:
begins and ends with a zero. 2. 2.
does not contain as a subword. 3. 3.
Any subword of not containing is -free. 4. 4.
is -free. 5. 5.
Let be obtained from by adding a one to each copy of in . Then is -free.
The key to proving Theorem 1.2 is the following lemma relating to .
Lemma 4.2
For ,
[TABLE]
Proof: Let be a permutation of the words in . Let if is odd and otherwise is obtained from by adding a one to each copy of in . The proof will follow similar lines to the proof of Lemma 4.1, but with the word serving as the analogue of a distinguished letter. That is, instead of introducing new letters, we use a special subword consisting only of ones.
To that end, let be the word formed by placing a copy of between each and and concatenating the sequence. To prove the lemma, it will suffice to show that satisfies the five properties required for a word to be in . As each , each begins and ends with a zero, and so also begins and ends with a zero, verifying the first property. As every subword of consisting only of ones has length at most and, since each begins and ends with a zero, there is a zero before and after each occurrence, does not contain , verifying the second property.
The third property asks that any subword of not containing is -free. Any such subword must be contained in for some but not containing the first or last letter. Recall also that starts and ends with [math]. By using the fourth and fifth properties of , we see that the word is -free when is odd and the word is -free when is even. Therefore, any copy of must start at the second letter of or end at the second to last letter of for some odd . Write with a copy of . As there are four copies of but at most two possible copies of in , and begins at the second letter or ends at the second to last letter of , must be all ones and have length at most . However, is a copy of and hence has length at least (since ), a contradiction. This verifies the third property.
We next verify the fourth property, that is -free. Suppose, for contradiction, that contains a copy of and this copy is of the form , where is a copy of . Neither copy of contains one of the subwords used to make , as otherwise the other copy of would have to contain an identical subword , but and are distinct. Hence, the left copy of must be in or for some and the right copy of must be in or for some . Write with a copy of . We will assume, without loss of generality, that the left copy of is in with odd (the other case may be handled similarly).
We first show that contains the copy of . If is in or in , then the fact that is a copy of would contradict the fourth and fifth properties of , respectively. If is in and contains the last letter of , then either the right copy of is a subword of , contradicting the fact that is a copy of (which must have length at least ), or the right copy of contains , forcing the left copy of to also contain but be a subword of , which is -free. In any case, we see that contains the copy of .
If does not intersect the copy of , then is entirely contained in one of the copies of and hence also in the other copy of , which is entirely in or , a contradiction. Hence, intersects the copy of . Thus, the left copy of is in and the right copy of is in . We now split into cases.
Case 1: contains as a subword.
In this case, as does not contain a copy of and ends with [math], the left copy of is in and ends at the last letter. Writing with a copy of , we see that the right copy of contains at most letters, as otherwise the left copy of , which is a subword of , would contain . But has length at least , contradicting the fact that is a copy of .
Case : contains as a subword but does not contain .
In this case, by the construction of , the right copy of must be a subword of with a subword of which begins and ends with a [math] and is -free. Writing with a copy of , we see, since , that contains as a strict subword. But if, for example, contains , this easily contradicts the fact that is a subword of with at most two copies of .
Case : does not contain as a subword.
In this case, considering the left copy of , by the third property of , is -free, contradicting that is a copy of . This completes the verification of the fourth property of .
Finally, we need to verify the fifth property. This says that if is obtained from by adding a one to each copy of in , then is -free. The proof of this is almost identical to the proof of the fourth property, but we include it for completeness.
Suppose, for contradiction, that there is a copy of in and this copy is of the form , where is a copy of . Note that while creating we did not change any of the words , since they start and end with [math] and contain no copy of by the second property of . Neither copy of contains a used to make , as otherwise the other copy of would have to contain an identical subword , but and are distinct. Hence, the left copy of must be in or for some and the right copy of must be in or for some . Write with a copy of . We will assume, without loss of generality, that the left copy of is in with odd (the other case may be handled similarly).
We first show that contains the copy of . If is a subword of , then the fact that is a copy of contradicts the fourth property of . If is a subword of , then the fact that is a copy of contradicts the fifth property of . If is in and contains one of the last two letters, then either the right copy of is a subword of , contradicting the fact that is a copy of (which must have length at least ), or the right copy of contains , forcing the left copy of to also contain but be a subword of , which is -free. If is in and contains the first letter of , then either the left copy of is a subword of , again a contradiction, or the left copy of contains , forcing the right copy of to also contain but be a subword of , which is -free. In any case, we see that contains the copy of .
If does not intersect the copy of , then is entirely contained in one of the copies of and hence also in the the other copy of , which is entirely in or , a contradiction. Hence, intersects the copy of . Thus, the left copy of is in and the right copy of is in . We again split into cases.
Case 1: contains as a subword.
In this case, as does not contain a copy of , the left copy of is in and ends at one of the last two letters. Writing with a copy of , we see that the right copy of contains at most letters, as otherwise the left copy of , which is a subword of , would contain . But has length at least , contradicting the fact that is a copy of .
Case : contains as a subword but does not contain .
In this case, by the construction of , the right copy of must be a subword of with a subword of which begins and ends with a [math] and is -free. Writing with a copy of , we see, since , that contains as a strict subword. But if, for example, contains , this easily contradicts the fact that is a subword of with at most two copies of .
Case : does not contain as a subword.
In this case, considering the left copy of , by the third property of , is -free, contradicting that is a copy of . We have therefore verified the fifth property of , completing the proof of the lemma.
We round off the section by proving Theorem 1.2, which states that there are binary words avoiding of length at least a tower of twos of height , that is,
[TABLE]
Proof of Theorem 1.2: We will begin by proving inductively that
[TABLE]
for all . For the base case, note that as all binary words of length beginning and ending with a [math] have the five desired properties. For the induction step, we use Lemma 4.2 to conclude that
[TABLE]
for , which easily gives the required result. To complete the proof of the theorem, we simply note that since all the words in have the same length, their common length must be at least .
5 Determining up to a constant factor
In this section, we prove Theorem 1.3, which determines the value of up to an absolute constant. We begin by proving the upper bound. In the proof, we will say that an interval is constant if only one letter appears in that interval.
Theorem 5.1
For ,
Proof: Let be a word of length over a -letter alphabet which does not contain . Observe that there are no two intervals of length three in which are disjoint, non-consecutive and constant with respect to a given letter, as otherwise contains . For each letter in our alphabet, if there is a constant interval of length three in that letter, we delete this interval and one of the letters immediately before or after it so that no constant intervals of length three in that letter remain. This process deletes at most intervals of length at most four, leaving disjoint intervals of of total length at least with the property that each such interval has no constant word of length three.
If such an interval has two consecutive letters that are identical, replace it by a single instance of the same letter to obtain a new word on a reduced interval . The word has no two consecutive identical letters and . By the pigeonhole principle, each interval of length in contains a copy of , and hence a minimal copy of . Each minimal copy of in consists of an interval with , where, for , we have if and only if and . The length of such a minimal copy of is and it contains distinct letters. The number of intervals of length in is . Let be the total number of such intervals of length taken over all of the at most intervals . Then
[TABLE]
Note that each minimal copy of in comes from a minimal copy of in , with each internal letter either originally coming from or . Thus, each minimal copy of of length in some comes from one of possible minimal copies of in . Note also that in the intervals of , we cannot have three copies of the same minimal , as otherwise the first and last copy of would be disjoint and separated by at least one letter, giving rise to a copy of . By the pigeonhole principle, if we get the same minimal copy of in the reduced intervals more than times, then we get three identical minimal copies of in the original word , giving a copy of , a contradiction. Let be the number of minimal copies of of length we get in total across the reduced intervals. As there are possible minimal copies of of length with no two consecutive letters equal, we have . Also, each copy of of length is in at most intervals of length . Hence,
[TABLE]
where the first equality follows by letting . Comparing the upper and lower bounds for , we get . Hence, for .
With some additional work, one can improve the upper bound in this theorem by an asymptotic factor of . In the proof described above, we obtained the interval from by collapsing any instances of to . In the worst case, where every letter appears twice, this may cause our interval to shrink by a factor of . However, by being more careful, one can get a bound which reflects the fact that one typically needs to collapse adjacent letters only half the time. We suspect that the bound which results from applying this idea may be optimal.
Question 5.1
Prove or disprove that
[TABLE]
We next present a lower bound construction, drawing on ideas used in the construction of de Bruijn sequences (see, for example, [4] or [9]), which gives . This bound is off from the actual value by a factor , but, as we will see below, may be modified to recover this missing factor.
Say that a word over a -letter alphabet has property if any two instances of the same letter have distance at least and all intervals of length are distinct. It is easy to check that any word with property avoids the Zimin word . Indeed, if there is a copy of of minimal length, then the consists of a single letter. Then has to consist of at least letters as any two instances of are at distance at least . As we get twice, this implies that there are two identical intervals of length , contradicting property .
We next prove that the length of the longest word with property is and hence . Indeed, it suffices to construct a word with property of length as any such word contains each of the possible intervals of length exactly once and by the pigeonhole principle it follows that this is the longest possible length of such a word.
Construct a directed graph on the words of length over a -letter alphabet that have distinct letters, where an edge is directed from vertex to vertex if the last letters of are the first letters of . Each vertex of this directed graph has indegree and outdegree . Thus, has edges. We claim that this directed graph is strongly connected, that is, it is possible to follow a directed path from any vertex to any other vertex.
Claim 5.1
The directed graph is strongly connected.
Proof: It suffices to show that there is a walk in from any vertex to any other vertex . By symmetry in the letters, we can assume is the word . For vertices which correspond to a permutation of , it will suffice to be able to get to any adjacent transposition of , since adjacent transpositions generate the group of permutations. Thus, we simply need to get from to , which is the same as except and have switched places. We can do this by considering the word formed by concatenating , then the single letter , and then . By considering successive intervals of length from this word of length , we find a walk of length from to in . We also need to show how to get from the word to another word which doesn’t have the same set of letters. Suppose, therefore, that has letter and does not have letter . Since we can get from any vertex to any permutation of its letters, we can assume is the same as but with replaced by . But, by concatenating and , we have a walk from to in of length , completing the proof.
As the directed graph has equal indegree and outdegree at each vertex and is strongly connected, it is Eulerian, that is, there exists an Eulerian tour covering all the edges. If we form a word by starting with the word of the first vertex and adding one letter at a time for each edge as we walk along the Eulerian tour, this gives a word of the desired length with property .
We now improve this argument to give a bound which is within a constant factor of the upper bound.
Theorem 5.2
For , .
Proof: Consider the directed graph on vertices, where each vertex is formed from a word of length with distinct letters by replacing each internal letter with or . Notice that the vertices are words of length somewhere between and . We place an edge from vertex to vertex if the last distinct letters of is the same as the first distinct letters from (this is without repetition of letters) and the subword of starting at the third distinct letter of and ending at the second to last letter of is the same as the subword of consisting of its second letter to its third to last distinct letter (this is with repetition of letters). For example, if with alphabet , then the outneighbors of vertex are , , , and . Each vertex of the directed graph has indegree and outdegree , so the number of edges of is .
A slight modification of Claim 5.1 shows that this directed graph is strongly connected. Indeed, the only substantial difference is that we also need to be able to get from one vertex to the vertex which is the same word as except a single internal letter that appears by itself in is replaced by or vice versa. But, by concatenating and , we get a walk in our directed graph from to of length . As the directed graph is strongly connected and each vertex has equal indegree and outdegree, it is Eulerian, and there is an Eulerian tour starting with a longest vertex (which corresponds to using each of the internal letters twice) covering all of the edges. This Eulerian tour gives rise to a word of the desired length that avoids . Indeed, it avoids as otherwise we would have two identical copies of , each giving rise to the same edges of in the Eulerian tour, contradicting the fact that each edge is used exactly once. Furthermore, each vertex has two outgoing edges which add one letter to the end of the word and two outgoing edges which add two letters to the end of the word. This gives an average of letters per edge, after the initial vertex of letters, giving a total length of .
6 Concluding remarks
Explicit constructions for . Our first proof that is non-constructive, relying upon an application of the Lovász local lemma. However, our second proof, discussed in Section 5 and giving a bound which is tight to within a constant, can be made algorithmic, constructing the required -free word in time polynomial in its length. Indeed, this proof boils down to constructing an Eulerian tour in an Eulerian directed graph and it is well known that this can be done efficiently.
Another, stronger notion of explicitness asks that each letter of the word can be computed in time polynomial in . We describe below another construction of a word of length which is -free and explicit in this sense. This construction is similar to the random construction used in the proof of Theorem 3.1, except that the permutation of used on the interval is now defined explicitly instead of randomly.
Split the alphabet arbitrarily into parts , each of size . Let denote the first primes and for . Writing , we have and each pair with is relatively prime. We construct a word of length , consisting of intervals of length . For , delete the lexicographic last permutations of , keeping the remaining permutations of . With period , we use these permutations in lexicographic order to fill the intervals . For this word to contain a copy of , it must contain two identical subwords, each consisting of consecutive intervals of length . But then the difference in their indices must be times a multiple of for values of . Since the are relatively prime, the difference of the indices must therefore be a multiple of . However, as the number of intervals is at most , there cannot be two such intervals, and we are done.
Random words. For fixed and tending to infinity, it is possible to show that the threshold length for the appearance of in a random word over an alphabet of size is . For example, over the English alphabet, with , we will likely find a copy of in a random word of length but the minimum word length needed to guarantee a copy is about . The proof, which we sketch below, is similar to the birthday paradox. This is easiest to see when , as we are simply looking for a word with repeated letters (with the slight caveat that we don’t want these letters to be adjacent).
To prove the upper bound, we estimate the number of copies of where each word is a single letter. The length of each such copy of is and, as there are variables , we see that the probability a random word of length is a copy of is . Therefore, if we take a random word of length , we expect roughly such copies of . Furthermore, the number of copies will be concentrated around this value and almost all of them will be disjoint and separated by at least one letter. We have possible copies of of this type (for comparison to the birthday paradox, think of as the number of days in a year) and once we get about of these short copies of , we will likely get two that are the same, giving a copy of . So we want with and, hence, .
The lower bound is a union bound over all possible (most of them are very unlikely, as the are so long that getting repeats of the same long word is incredibly unlikely). In fact, there is even a hitting time result (think of building a word one letter at a time here, adding letters at the end) saying that almost surely appears at the same time when you first find two identical copies of , each of length .
-unavoidability. Recall that a pattern is -unavoidable if every sufficiently long word over a -letter alphabet contains a copy of . Though the results of Zimin and Bean, Ehrenfeucht and McNulty completely determine those patterns which are -unavoidable for all , much less is known about the patterns which are -unavoidable for some . In particular, given , one may ask whether there is a pattern which is -unavoidable but -avoidable. Words with this property are known for and , but it is an open problem to construct such words for . To give some indication of the difficulty, we note that the pattern constructed by Clark [6] which is -unavoidable but -avoidable is , which admits no obvious generalisation. In light of such difficulties, we believe that any further progress on understanding those patterns which are -unavoidable for some but not all would be interesting.
Note added in proof. After this paper was completed, we learned that a variant of our Theorem 1.2 was obtained simultaneously and independently by Carayol and Göller [5]. It is also worth noting that the results of Section 3 give an affirmative answer to a question raised in their paper, namely, whether the probabilistic method can be used to give a tower-type lower bound for the function .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J.-P. Allouche and J. Shallit, The ubiquitous Prouhet–Thue–Morse sequence, in Sequences and their applications: Proceedings of SETA ’98, 1–16, Springer, London, 1999.
- 2[2] N. Alon and J. H. Spencer, The probabilistic method , 4th edition, Wiley, 2015.
- 3[3] D. A. Bean, A. Ehrenfeucht and G. F. Mc Nulty, Avoidable patterns in strings of symbols, Pacific J. Math. 85 (1979), 261–294.
- 4[4] J. Berstel and D. Perrin, The origins of combinatorics on words, European J. Combin. 28 (2007), 996–1022.
- 5[5] A. Carayol and S. Göller, On long words avoiding Zimin patterns, in 34th Symposium on Theoretical Aspects of Computer Science (STACS 2017), Article No. 19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2017.
- 6[6] R. J. Clark, The existence of a pattern which is 5 5 5 -avoidable but 4 4 4 -unavoidable, Internat. J. Algebra Comput. 16 (2006), 351–367.
- 7[7] D. Conlon, J. Fox and B. Sudakov, Recent developments in graph Ramsey theory, in Surveys in Combinatorics 2015, London Math. Soc. Lecture Note Ser., Vol. 424, 49–118, Cambridge University Press, Cambridge, 2015.
- 8[8] J. Cooper and D. Rorabaugh, Bounds on Zimin word avoidance, Congr. Numer. 222 (2014), 87–95.
