11institutetext: Loughborough University, UK, 11email: [email protected] 22institutetext: Kiel University,
Germany 22email: {fpa,flm,dn}@informatik.uni-kiel.de
k-Spectra of Weakly-c-Balanced Words
Joel D. Day
11
Pamela Fleischmann
22
Florin
Manea
22
Dirk Nowotka
22
Abstract
A word u is a scattered factor of w if u can be obtained from w by deleting some of its
letters. That is, there exist the (potentially empty) words u1,u2,...,un, and
v0,v1,..,vn such that u=u1u2...un and w=v0u1v1u2v2...unvn. We consider the
set of length-k scattered factors of a given word w, called here k-spectrum and denoted
ScatFactk(w). We prove a series of properties of the sets ScatFactk(w) for binary
weakly-[math]-balanced and, respectively, weakly-c-balanced words w, i.e., words over a two-letter alphabet where the number of occurrences of each letter is
the same, or, respectively, one letter has c-more occurrences than the other.
In particular, we consider the question which cardinalities n=∣ScatFactk(w)∣ are obtainable, for a positive integer k, when w is either a weakly-[math]-balanced binary word of
length 2k, or a weakly-c-balanced binary word of length 2k−c. We also consider the problem of
reconstructing words from their k-spectra.
1 Introduction
Given a word w, a scattered factor (also called scattered subword, or simply subword in the
literature) is a word obtained by removing one or more factors from w. More formally, u is a
scattered factor of w if there exist u1,…,un∈Σ∗, v0,…,vn∈Σ∗ such that u=u1u2…un and w=v0u1v1u1…unvn.
Consequently a scattered factor of w can be thought of as a representation of w in which some parts
are missing. As such, there is considerable interest in the relationship of a word and its
scattered
factors from both a theoretical and practical point of view. For an introduction to the study of scattered factors,
see Chapter 6 of [9].
On the one hand, it is easy to imagine how, in any situation where discrete, linear data is read
from an imperfect input – such as when sequencing DNA or during the transmission of a digital
signal – scattered factors form a natural model, as multiple parts of the input may be missed, but
the rest will remain unaffected and in-sequence. For instance, various applications and connections of this model in verification are discussed in [14, 6] within a language theoretic framework, while applications of the model in DNA sequencing are discussed in [4] in an algorithmic framework.
On the other hand, from a more algebraic
perspective, there have been efforts to bridge the gap between the non-commutative field of
combinatorics on words with traditional commutative mathematics via Parikh matrices (cf.
e.g., [11, 13]) which are closely related to, and influenced by the topic of
scattered
factors.
The set (or also in some cases, multi-set) of scattered factors of a word w, denoted
ScatFact(w) is typically exponentially large in the length of w, and contains a lot of
redundant information in the sense that, for k′<k≤∣w∣, a word of length k′ is a
scattered factor of w if and only if it is a scattered factor of a scattered factor of w of
length k. This has led to the idea of k-spectra: the set of all length-k scattered factors of
a word. For example, the 3-spectrum of the word ababbb is the set
{aab,aba,abb,bab,bbb}.
Note that unlike some literature, we do not consider the k-spectra to be the multi-set of
scattered factors in the present work, but rather ignore the multiplicities. This distinction is
non-trivial as there are significant variations on the properties based on these different
definitions (cf. e.g., [10]).
Also, the notion of k-spectra is closely related to the classical notion of
factor complexity of words, which counts, for each positive integer k, the
number of distinct factors of length k of a word. Here, the cardinality of the
k-spectrum of a word gives the number of the word’s distinct scattered
factors of length k.
One of the most fundamental questions about k-spectra of words, and indeed sets of scattered
factors in general, is that of recognition: given a set S of words (of length
k), is S the subset
of a k-spectrum of some word? In general, it remains a long standing
goal of the theory to give a “nice” descriptive characterisation of scattered factor sets (and
k-spectra), and to better understand their structure [9].
Another fundamental question concerning k-spectra, and one well motivated in
several
applications,
is the question of reconstruction: given a word w of length n, what is the smallest value k such that the
k-spectrum of w is uniquely determined?
This question was addressed and solved successively in a variety of cases. In particular, in [3], the exact bound of
2n+1 is given in the general case. Other variations, including for the definition of
k-spectra where multiplicities are also taken into account, are considered
in [10], while [7] considers the question of reconstructing words
from their palindromic scattered factors.
In the current work, we consider k-spectra in the restricted setting of
a binary alphabet Σ={a,b}. For such an alphabet, we can always identify the natural number c∈N0
which describes how weakly balanced a word is: c is the difference between
the amount of as and bs. Thus, it seems natural to categorise all words
over Σ according to this difference: a binary word where one letter has
exactly c more occurrences than the other one is called weakly-c-balanced. In
Section 3 the cardinalities of k-spectra of weakly-c-balanced words of length
2k−c are investigated. Our first results concern the minimal and maximal cardinality ScatFactk might have.
We show that the cardinality ranges for weakly-[math]-balanced between k+1 and 2k, and determine exactly for which words of length 2k these values are reached.
In the case of weakly-c-balanced words, we are able to replicate the result regarding the minimal cardinality of ScatFactk, but the case of maximal cardinality seems to be more complicated.
To this end, it seems that the words containing many alternations between the two letters of the alphabet have larger sets ScatFactk. Therefore, we first investigate the scattered factors of the words which are prefixes of
(ab)ω and give a precise description of all scattered factors of any length of such words. That is, not only we compute the cardinality of ScatFactk(w), for all such words w,
but also describe a way to obtain directly the respective scattered factors, without repetitions. We use this to describe exactly the sets ScatFacti for the word (ab)k−cac, which seems a good candidate for a weakly-c-balanced word with many distinct scattered factors.
Further, in Section 4, we explore more the cardinalities of
ScatFactk(w) for weakly-[math]-balanced words w of length 2k. We obtain
for these words that the smallest three numbers which are possible
cardinalities for their k-spectra are k+1,
2k, and 3k−3, thus identifying two gaps in the set of such cardinalities.
Among other results on this topic, we show that for every constant i there
exist a word w of length 2k such that ∣ScatFactk(w)∣∈Θ(ni); we
also show how such a word can be constructed.
Finally, in Section 5, we also approach the question of reconstructing
weakly-[math]-balanced words from k-spectra in the specific case that the spectra are also limited to
weakly-[math]-balanced
words only. While we are not able to resolve the question completely, we conjecture that the
situation is similar to the general case: the smallest value k such that the
k-spectrum of w is uniquely determined is k=2∣w∣+1 if
2∣w∣
is odd and k=2∣w∣+2, otherwise, in the case when w contains at most two blocks of bs.
After introducing a series of basic definitions, preliminaries, and
notations, the organisation of the paper follows the description above. The proofs can be found in [2].
2 Preliminaries
Let N be the set of natural numbers, N0=N∪{0}, and let
N≥k be all natural numbers greater than or equal to k.
Let [n] denote the set {1,…,n} and [n]0=[n]∪{0}
for an n∈N.
We consider words w over the alphabet Σ={a,b}.
Σ∗ denotes the set of all finite words over Σ, also called binary words.
Σω the set of all infinite words over Σ, also called binary infinite words.
The empty word is denoted by ε and
Σ+ is the free semigroup Σ∗\{ε}.
The length of a word w is denoted by ∣w∣. Let
Σ≤k:={w∈Σ∗∣∣w∣≤k} and Σk be the
set of all words of length exactly k∈N. The number of occurrences of a
letter a∈Σ in a word w∈Σ∗ is denoted by ∣w∣a.
The ith letter of a word w is given by w[i] for i∈[∣w∣]. For a given
word w∈Σn the reversal of w is defined by
wR=w[n]w[n−1]…w[2]w[1]. The powers of w∈Σ∗ are
defined recursively by w0=ε, wn=wwn−1 for n∈N.
A word w∈Σ∗ is called weakly-c-balanced if ∣∣w∣a−∣w∣b∣=c
for c∈N0.
Thus weakly-[math]-balanced words have the same number of as and bs. Let
Σwzb∗ be the set of all weakly-[math]-balanced words over Σ.
For example, abaa is weakly-2-balanced, aba is weakly-1-balanced,
while abbaba is weakly-[math]-balanced.
A word u∈Σ∗ is a factor of w∈Σ∗, if
w=xuy holds for some words x,y∈Σ∗. Moreover, u is a
prefix of w if x=ε holds and a suffix if
y=ε holds. The factor of w from the ith to the jth
letter will be denoted by w[i..j] for 0≤i≤j≤∣w∣.
Given a letter a∈Σ and a word w∈Σ∗, a block of
a is a factor u=w[i..j] with u=aj−i, such that either i=1 or
w[i−1]=b=a and either j=∣w∣ or w[j+1]=b=a. For example the
word abaaabaabb has 3 a-blocks and 3 b-blocks.
Scattered factors and k-spectra are defined as follows.
Definition 1
A word u=a1…an∈Σn, for n∈N, is a scattered factor of
a word
w∈Σ+ if there exists v0,…,vn∈Σ∗ with
w=v0a1v1…vn−1anvn. Let ScatFact(w) denote the set of w’s scattered factors and
consider
additionally ScatFactk(w) and ScatFact≤k(w) as the two subsets of
ScatFact(w)
which contain only the scattered factors of length k∈N or the ones up to
length k∈N.
The sets ScatFact≤k(w) and ScatFactk(w) are also known as full
k-spectrum and, respectively,
k-spectrum of a word w∈Σ∗ (see [1],
[10],
[12]) and moreover, scattered factors are often called subwords
or scattered subwords. Obviously the k-spectrum is empty for k>∣w∣ and
contains
exactly w’s letters for k=1 and only w for k=∣w∣. Considering the word
w=abba,
the other spectra are given by ScatFact2(w)={a2,b2,ab,ba}
and
ScatFact3(w)={ab2,aba,b2a}.
It is worth noting that if u is a scattered factor of w, and v is a
scattered factor of u, then v is a scattered factor of w.
Additionally, notice two important symmetries regarding k-spectra.
For w∈Σ∗ and the renaming morphism
⋅:Σ→Σ with a=b and
b=a we have ScatFact(wR)={uR∣u∈ScatFact(w)} and ScatFact(w)={u∣u∈ScatFact(w)}.
Thus, from a structural point of
view, it is sufficient
to consider only one
representative from the equivalence classes induced by the equivalence relation
where w1 is equivalent to w2 whenever
w2 is obtained by a composition of reversals and renamings from w1.
Considering w.l.o.g. the order a<b on Σ, we
choose the
lexicographically smallest word as representative from each class. As such, we will mostly analyse the k-spectra of
words starting with a. We shall
make
use of this fact
extensively in Section 4.
3 Cardinalities of k-Spectra of Weakly-c-Balanced Words
In the current section, we consider the combinatorial properties of k-spectra of
weakly-c-balanced finite words. In particular, we are interested in the cardinalities of the
k-spectra and in the question: which cardinalities are (not) possible? Since
the k-spectra of an and bn are just ak and bk
respectively for all n∈N0 and k∈[n]0, we assume
∣w∣a,∣w∣b>0 for w∈Σ∗.
It is
a straightforward observation that not every subset of Σk is a
k-spectrum of some word w. For example, for k=2, aa and bb can
only be scattered factors of a word containing both as and bs, and
therefore having either ab or ba as a scattered factor as well. Thus, there is no word w such that ScatFact2(w)={aa,bb}.
In general, for any word containing only as or only bs, there will be exactly one scattered factor of each length, while for words containing both a’s and b’s, the smallest k-spectra are realised for words of the form w=anb (up to renaming and reversal), for which ScatFactk(w)={ak,ak−1b} for each k∈[∣w∣]. On the other hand, as Proposition 5 shows, the maximal k-spectra are those containing all words of length k – and hence have size 2k, achieved by e.g. w=(ab)n for n≥k. Note that when weakly-[math]-balanced words are considered, the same maximum applies, since (ab)n is weakly-[math]-balanced, while the minimum does not, since anb is not weakly-[math]-balanced.
It is straightforward to enumerate all possible k-spectra, and describe the words realising them for k≤2, hence we shall generally consider only k-spectra in the sequel for which k≥3.
Our first result generalises the previous
observation about minimal-size k-spectra.
Theorem 3.2
For k∈N≥3, c∈[k−1]0, i∈[c]0, and a weakly-c-balanced word
w∈Σ2k−c, we have
∣ScatFactk−i(w)∣≥k−c+1, where equality holds if and only if
w∈{akbk−c,ak−cbk,bkak−c,bk−cak}.
Moreover, if w∈Σwzb2k\{akbk}, then ∣ScatFactk(w)∣≥k+3.
Proof
Consider firstly only weakly-[math]-balanced words, i.e. c=0 and w.l.o.g. only
w=akbk. The cases k=1 and k=2 are the induction basis.
The word akbk has obviously all arbs for r,s∈[k]0 as
scattered factors, thus
k+1 many. This proves the ⇐-direction.
Consider now a word w∈Σwzb2k\{akbk,bkak}.
Since
w is not akbk, w contains a factor aba or
bab. Assume w.l.o.g. that w=xabay holds for
x,y∈Σ∗ with
∣x∣+∣y∣=2k−3. By w∈Σwzb2k follows that ∣x∣b or
∣y∣b is not zero. Choose
w.l.o.g. z1,z2∈Σ∗ with y=z1bz2 which implies
w=xabaz1bz2. Consequently
∣xz1z2∣a=∣xz1z2∣b=k−2 holds.
case 1: xz1z2=ak−2bk−2
By induction ∣ScatFactk−2(xz1z2)∣=(k−2)+1=k−1. Let u be a scattered
factor of
xz1z2 of length k−2. Then there
exist u1,u2, and u3 such that u1 is a scattered factor of x, u2
of z1, and
u3 of z3 respectively. Consequently
[TABLE]
are different elements of ScatFactk(w).
Each scattered factor of xz1z2 is of the form arbs for
r,s∈[k−2]0. We will now prove in which cases the aforementioned scattered
factors are different. Consider u=u1u2u3=arbs and
u′=u1′u2′u3′=ar′bs′ to be different scattered factors of this
form, i.e. r=r′ and s=s′. Set
[TABLE]
If u1=ar1, u2u3=ar2bs and u1′=ar1′,
u2′u3′=ar2′bs′ with r1+r2=r and r1′+r2′=r′, we get
because of r=r′, r1=−1,
[TABLE]
If u1=ar1, u2u3=ar2bs and u1′=ar′bs1′,
u2′u3′=bs2′ with r1+r2=r, s1′+s2′=s′, and s1′=0
(already in the previous case) we get
because of s1′=0,
[TABLE]
If u1=arbs1, u2u3=bs2 and u1′=ar′bs1′,
u2′u3′=bs2′ with r1+r2=r, s1′+s2′=s′, and s1,s1′=0
(already in the previous case) we get
because of r′=r and s1,s1′=0,
[TABLE]
Consequently α1 and α2 are all different and we get 2(k−1)
many different scattered factors. Assume now additionally ∣r−r′∣=3.
If u1=ar1, u2u3=ar2bs and u1′=ar1′,
u2′u3′=ar2′bs′ with r1+r2=r and r1′+r2′=r′, we get
because of s1′=0, r′=r, r′=r+1
[TABLE]
If u1=ar1, u2u3=ar2bs and u1′=ar′bs1′,
u2′u3′=bs2′ with r1+r2=r, s1′+s2′=s′, and s1′=0
(already in the previous case) we get
because of s1′=0, r′=r+2,
[TABLE]
If u1=arbs1, u2u3=bs2 and u1′=ar′bs1′,
u2′u3′=bs2′ with r1+r2=r, s1′+s2′=s′, and s1,s1′=0
(already in the previous case) we get
because of r′=r and s1,s1′=0, r′=r+2,
[TABLE]
Consequently we have another ⌊3k−2⌋+1 different
scattered factors. This sums up to ∣ScatFactk(w)∣≥37k−8>k+1. An immediate result is that the k-spectrum has
at least k+3 elements for k≥5. For k=3 and k=4 the results
can be easily verified by testing.
case 2: xz1z2=ak−2bk−2
In this case all words of the form arabaas for r+s=k−3,
r∈[∣x∣a]0, and
s∈[∣y∣a]0 are ∣x∣a+1 different scattered factors of length k
of w.
Analogously all
br′ababs′ with r′+s′=k−3, r′∈[∣x∣b]0,
s′∈[∣y∣b]0 are ∣x∣b+1
different scattered factors of length k of w. All these factors are
different and additionally
w has ak and bk as scattered factors. Hence ∣ScatFactk(w)∣≥∣x∣a+∣x∣b+4=∣x∣+4 holds. Since the
length of w is 2k, the length of xy is 2k−3 and consequently x and y
have different
lengths. Assume w.l.o.g. ∣x∣>∣y∣, i.e. ∣x∣≥k−1. This implies
∣ScatFactk(w)∣≥k+3 follows. This proves the claim for c=0.
Assume now c>0 and let w=akbk−c. By the previous part we know
∣ScatFactk−c(w)∣=k−c+1 if and only if w=ak−cbk−c. The claim
about the
(k−c)-spectrum follows immediately by
ScatFactk−c(w)=ScatFactk−c(akbk−c) since the prepended as
do not change the (k−c)-spectrum. For i∈[c−1]0 notice
that x∈ScatFactk−i(akbk−c) implies that ax (resp.
xb, xa, bx) is a scattered factor of akbk−c of length
k−i+1. Thus
∣ScatFactk−i+1(w)∣≥k−c+1 follows.
On the other hand a scattered factor of akbk−c of length k−i+1
is exactly of this form, since it can neither start with b
(akbk−c has only
(k−c) occurrences of b) nor
contain ba resp. ab (this would be the implication of a scattered
factor being of the
form ax′ with ∣x′∣=k−i, x′∈ScatFactk−i(akbk−c)). ∎
Remark 3
Theorem 3.2
answers immediately the question, whether a given set
S⊆Σk, with ∣S∣<k+1 or ∣S∣=k+2, is a k-spectrum of a word w∈Σwzb2k in the negative.
Theorem 3.2 shows that the smallest cardinality of the k-spectrum of a word w
is reached when the letters in w are nicely ordered, both for
weakly-[math]-balanced words as well as for weakly-c-balanced words with c>0. The
largest cardinality is, not surprisingly, reached for words where the alternation of a and b letters is, in a sense, maximal, e.g., for w=(ab)k. To this end, one can show a general result.
Theorem 3.4
For w∈Σ∗, the k-spectrum of w is Σk if and only
if
[TABLE]
The previous theorem has an immediate consequence, which exactly characterises the weakly-[math]-balanced words of length 2k for which the maximal cardinality of ScatFactk(w) is reached.
Proof
We will show this result by induction.
For k=1, the equivalence is:
[TABLE]
If both a and b are scattered factors of w, ab or ba has
to be a factor and thus a scattered factor of w. On the other hand if w has
ab or ba as a scattered factor, it has a and b as scattered
factors.
Assume now that the equivalence holds for an arbitrary but fixed k−1∈N. We will show it holds for k.
For the ⇐-direction consider
u∈{ab,ba}k∩ScatFact2k(u). Thus,
u∈{ab,ba}k−1{ab,ba} and hence there exists
u′∈{ab,ba}k−1 with u∈u′{ab,ba}. By induction
we have
ScatFactk−1(u′)=Σk−1. For any x∈Σk exists
x′∈Σk−1
with x∈x′{a,b}. This implies that there exist
a0,…, ak−1∈Σ∗ with u′=a0x′[1]a1…x′[k−1]ak−1 since x′∈ScatFactk−1(u′).
By
[TABLE]
it follows in both cases, namely x=x′a or x=x′b, that
x∈ScatFactk(w).
This proves the inclusion Σk⊆ScatFactk(w).
By ScatFactk(w)⊆Σk the first direction is proven.
For the ⇒-direction assume ScatFactk(w)=Σk. Assume w.l.o.g. w[∣w∣]=a. Choose x,y∈Σ∗
with w=xy and x[∣x∣]=b, and y∈a∗. As Σk−1b⊂Σk, it follows that Σk−1b⊆ScatFactk(x). Clearly, this means that
Σk−1⊆ScatFactk−1(x[1..∣x∣−1]). By the induction hypothesis, we get that {ab,ba}k−1∩ScatFact2(k−1)(x[1..∣x∣−1])=∅. Thus,
{ab,ba}k−1x[∣x∣]a∩ScatFact2k(w[1..∣x∣+1])=∅, because w[1..∣x∣+1]=x[1..∣x∣]b. Hence, {ab,ba}k−1ba∩ScatFact2k(w)=∅. The conclusion follows.
∎
Proposition 5
For k∈N≥3 and w∈Σwzb2k we have
w∈{ab,ba}k if and only if
ScatFactk(w)=Σk.
Proof
If w∈{ab,ba}k, then
{ab,ba}k∩ScatFact2k(w)=∅ and the claim follows
by Theorem 3.4. On the other hand if
ScatFactk(w)=Σk then
{ab,ba}k∩ScatFactk(w)=∅ and since ∣w∣=2k we get
w∈{ab,ba}k. ∎
To see why from w∈{ab,ba}k it follows that ScatFactk(w)=Σk, note that, by definition, a word w∈{ab,ba}k is just a concatenation of k blocks from
{ab,ba}. To construct the scattered factors of w, we can simply select from each block either the a or the b. The resulting output is a word of length k, where in each position we could choose freely the letter. Consequently, we can produce all words in Σk in this way. The other implication follows by induction.
Generalising Proposition 5 for weakly-c-balanced words requires a more sophisticated approach. A
generalisation would be to consider w∈{ab,ba}k−cac. By Theorem 3.4 we have ScatFactk−c(w)=Σk−c. But the
size of ScatFactk−i(w) for i∈[c]0 depends on the specific choice
of w. To see why, consider the words w1=baabba and
w2=(ba)3. Then by Proposition 5,
∣ScatFact3(w1)∣=8=∣ScatFact3(w2)∣. However, when we append an a to the end of both w1 and w2, we see that in fact
∣ScatFact4(w1a)∣=11=12=∣ScatFact4(w2a)∣.
The main difference
between weakly-[math]-balanced and weakly-c-balanced words for c>0, regarding the maximum
cardinality of the scattered factors-sets, comes from the role played by the factors a2 and b2 occurring in w.
In the remaining
part of this section
we present a series of results for weakly-c-balanced words. Intuitively, the words with many alternations between a and b have more distinct scattered factors. So, we will focus on such words mainly.
Our first result is a direct consequence from Theorem 3.4. The second result concerns words avoiding a2 and b2 gives a method to identify efficiently the ℓ-spectra of words which are prefixes of
(ab)ω, for all ℓ. Finally, we are able to derive a way to efficiently enumerate (and count) the scattered factors of length k of (ab)k−cac.
Corollary 6
For k∈N≥3, c∈[k]0, and
w∈Σ2k−c weakly-c-balanced, the
cardinality of ScatFactk−c(w) is exactly 2k−c if and only if
ScatFact2(k−c)(w)∩{ab,ba}k−c=∅.
Proof
The claim follows directly by Theorem 3.4.∎
As announced, we further focus our investigation on the words w=(ab)k−cac.
By Theorem 3.4 we have ∣ScatFacti(w)∣=Σi for all i∈[k−c]0. For all i with k−c<i≤k, a more sophisticated counting argument is needed. Intuitively, a scattered factor of length i of (ab)k−cac consists of a part that is a scattered factor (of arbitrary length) of (ab)k−c followed by a (possibly empty) suffix of as.
Thus, a full description of the ℓ-spectra of words that occur as prefixes of (ab)ω, for all appropriate ℓ, is useful. To this end, we introduce the
notion of a deleting sequence: for a word w and a scattered factor u of w the deleting sequence contains (in a strictly increasing order) w’s positions
that have to be deleted to obtain u.
Definition 7
For w∈Σ∗, σ=(s1,…,sℓ)∈[∣w∣]ℓ, with ℓ≤∣w∣ and
si<si+1 for all i∈[ℓ−1], is a deleting sequence.
The scattered factor uσ associated to a deleting sequence σ is uσ=u1…uℓ+1, where u1=w[1..s1−1], uℓ+1=w[sℓ+1..∣w∣], and ui=w[si−1+1..si−1] for 2≤i≤ℓ.
Two sequences σ,σ′ with uσ=uσ′ are called equivalent.
For the word w=abbaa and σ=(1,3,4) the associated scattered
factor is uσ=ba. Since ba can also be generated by (1,3,5),
(1,2,4) and (1,2,5), these sequences are equivalent.
In order to determine the ℓ-spectrum of a word w∈Σn for
ℓ,n∈N, we can determine how many equivalence classes does the equivalence defined above have, for
sequences of length k=n−ℓ. The following three lemmas characterise
the equivalence of deleting sequences.
Lemma 8
Let w∈Σn be a prefix of
(ab)ω. Let σ=(s1,…,sk) be a deleting sequence for w
such that there exists j≥2 with sj−1<sj−1 and sj+1=sj+1. Then
σ is equivalent
σ′=(s1,…,sj−1,sj−1,sj+1−1,sj+2,…sk), i.e., σ′ is the
sequence σ where both sj and sj+1 were decreased by 1.
Proof
Since sj−1<sj−1, the factor uσ contains the letter w[sj−1]. If
w[sj]=a then w[sj+1]=w[sj+1]=b and w[sj−1]=b. Clearly, when
deleting w[sj−1] and
w[sj] according to the sequence σ′, the b that was corresponding
to w[sj−1] will be replaced by a letter b corresponding to w[sj+1],
which is not deleted. So, in the end, uσ′=uσ. The case
w[sj]=b is analogous. ∎
Lemma 9
Let w∈Σn be a prefix of
(ab)ω. Let σ=(s1,…,sk) be a deleting sequence for w.
Then there exists an integer j≥0 such that σ is equivalent to the
deleting sequence (1,2,…,j,sj+1′,…,sk′), where sj+1′>j+1
and si′>si−1′+1, for all j<i≤k. Moreover, j≥1 if and only if σ contained two consecutive positions or σ started with 1.
Proof
Let σ0=σ. For i≥0, we iteratively transform σi into
σi+1 as follows: if σi contains on consecutive positions the
numbers g,t,t+1,h, such that g<t−1 and h>t+2, we replace them by
g,t−1,t,h and obtain the sequnce σi+1. By Lemma 8,
σi is equivalent to σi+1. It is clear that in O(n2) steps
we will reach a sequence σℓ which cannot be transformed anymore. We
take σ′=σℓ and it is immediate that it will have the required
form.∎
Lemma 10
Let w∈Σn be a prefix of
(ab)ω. Let σ1=(1,2,…,j1,sj1+1′,
…,sk′),
where sj1+1′>j1+1 and si′>si−1′+1, for all j1<i≤k, and
σ2=(1,2,…,j2,sj2+1′′,…,sk′′), where sj2+1′′>j2+1
and si′′>si−1′′+1, for all j2<i≤k. If σ1=σ2 then
σ1 and σ2 are not equivalent (i.e., uσ1=uσ2).
Proof
We first consider the case j1=j2. Let ℓ to be minimum such that
sℓ′=sℓ′′. We can assume without losing generality that sℓ′<sℓ′′. Then uσ1 and uσ2 share the same prefix of
length t=(sℓ′−1)−(ℓ−1). This prefix ends with w[sℓ′−1] and is
followed by w[sℓ′+1] in uσ1 and, respectively, by w[sℓ′]
in uσ2. But w[sℓ′+1]=w[sℓ′], so uσ1=uσ2.
Further, we consider the case when j1<j2 (the case j2<j1 is symmetric);
assume, as a convention, that sk+1′′=0 and let d=j2−j1. Clearly, j1
and j2 must have the same parity, or uσ1 and uσ2 would
start with different letters, so they would not be equal. Let ℓ to be
minimum integer such that sℓ′−j1=sℓ+d′′−j2; because
sk+1′′=0 by convention, we have ℓ≤k. If both ℓ and ℓ+d
are at most k, then we get similarly to the case j1=j2 that
uσ1=uσ2. In the case when ℓ≤k<ℓ+d, then, by
length reasons, all positions j>sℓ (so, including sℓ+1) in w
should belong to σ1, a contradiction. This concludes our proof. ∎
Lemmas 8, 9, and 10 show that the representatives of the equivalence classes w.r.t. the equivalence relation between deleting sequences, introduced in Definition 7, are the sequences (1,2,…,j,sj+1′,…,sk′), where sj+1′>j+1 and si′>si−1′+1, for all j<i≤k. For a fixed j≥1, the number of such sequences is (k−j(n−j−1)−(k−j)+1)=(k−jn−k). For j=0, we have (k(n−1)−k+1)=(kn−k) nonequivalent sequences (note that none starts with 1, as those were counted for j=1 already). In total, we have, for a word w of length n, which is a prefix of (ab)ω, exactly ∑j∈[k]0(k−jn−k) nonequivalent deleting sequences of length k, so ∑j∈[k]0(k−jn−k) different scattered factors of length n−k. In the above formula, we assume that (ba)=0 when a<b.
Moreover, the distinct scattered factors of length ℓ=n−k of w can be obtained efficiently as follows. For j from [math] to ℓ, delete the first j letters of w. For all choices of ℓ−j positions in w[j+1..n], such that each two of these positions are not consecutive, delete the letters on the respective positions. The resulted word is a member of ScatFactℓ(w), and we never obtain the same word twice by this procedure.
The next theorem follows from the above.
Theorem 3.11
Let w be a word of length n which is a prefix of (ab)ω. Then ∣ScatFactℓ(w)∣=∑j∈[n−ℓ]0(n−ℓ−jℓ).
A straightforward consequence of the above theorem is that, if ℓ≤n−ℓ then ∣ScatFactℓ(w)∣=2ℓ.
With Theorem 3.11, we can now completely characterise the cardinality of the
ℓ-spectra of the weakly-c-balanced word (ab)k−cac for ℓ≤k.
Theorem 3.12
Let w=(ab)k−cac for k∈N, c∈[k]0. Then, for i≤k−c we have ∣ScatFacti(w)∣=2i. For k≥i>k−c we have ∣ScatFacti(w)∣=1+2k−c+∑j∈[(i+c)−k−1]0∣ScatFacti−j−1((ab)k−c−1a)∣.
Proof
We will need to show the proof for k≥i>k−c, as the other part follows immediately from Theorem 3.4.
We give a method to count the scattered factors of w=(ab)k−cac. To
begin with, we have the scattered factor ai. All the other scattered
factors must contain a letter b. Thus, we count separately the scattered
factors of the form ubaj, for each j∈[i−1]0. This is equivalent to
counting in how many ways we can choose u. For each such u we will just have
to append baj at the end to get the desired scattered factors of length.
Thus, ∣u∣=i−j−1. If j≥c then u should occur as a scattered factor of
(ab)k−j−1a (in order to be able to append baj at its end and
still stay as a scattered factor of w), while if j<c then u should occur
as a scattered factor of (ab)k−c−1a. In the first case, the length of
the scattered factor u we want to generate is less than half of the length of
the word (ab)ta from which we generate it. So, there are 2i−j−1 choices
for u. In the second case, if j≥(i+c)−k, again, the length of the
scattered factor u we want to generate is less than half of the length of the
word (ab)k−c−1a from which we generate it. So, there are 2i−j−1
choices for u again. Finally, if j<(i+c)−k, then there i−j−1>k−c−1, and
we need Theorem 3.4 to generate u. There are
∣ScatFacti−j−1((ab)k−c−1a)∣ ways to choose u in this case.
Summing all these up, we get the result from the statement:
[TABLE]
[TABLE]
This concludes our proof.∎
As in the case of the scattered factors of prefixes of (ab)ω, we have a precise and efficient way to generate the scattered factors of w=(ab)k−cac. For scattered factors of length i≤k−c of w, we just generate all possible words of length i. For greater i, on top of ai, we generate separately the scattered factors of the form ubaj, for each j∈[i−1]0. It is clear that, in such a word, ∣u∣=i−j−1, and if j≥c then u must be a scattered factor of (ab)k−j−1a, while if j<c then u must be a scattered factor of (ab)k−c−1a. If j≥(i+c)−k then, by Theorem 3.11, u can take all 2i−j−1 possible values. For smaller values of j, we need to generate u of length i−j−1 as a scattered factor of (ab)k−c−1a, by the method described after Proposition 5.
Nevertheless, Theorems 3.11 and 3.12 are useful to see that in order to determine the cardinality of the sets of scattered factors of words consisting of alternating as and bs or, respectively, of (ab)k−cac, it is not needed to generate these sets effectively.
4 Cardinalities of k-Spectra of Weakly-[math]-Balanced
Words
In the last section a characterisation for
the smallest and the largest k-spectra of words of a given length are presented (Theorem 3.2 and Proposition 5). In this
section the part in between will be investigated for weakly-[math]-balanced words (i.e. words
of length 2k with k occurrences of each letter).
As before, we shall assume that k∈N≥3.
In the particular case that k=3, we have already proven that the k-spectrum with minimal cardinality
has
4 elements and
that the maximal cardinality is 8. Moreover as mentioned in
Remark 3 a
k-spectrum of cardinality 5 does not exist for weakly-[math]-balanced words of length 2k. The question remains
if k-spectra of cardinalities 6 and 7 exist, and if so, for which words.
Before showing that a k-spectrum of cardinality 2k−1 for
weakly-[math]-balanced
words of length 2k also exists for all
k∈N≥3, we prove that only scattered factors of the form
bi+1ak−i−1 for i∈[k−2]0 (up to renaming, reversal) can be
“taken out” from the full set of possible scattered factors independently, without additionally requiring the removal of additional scattered factors as well.
In particular, if a word of length k of another form is absent from the set
of
scattered factors
of w, then ∣ScatFactk(w)∣<2k−1 follows.
Lemma 13
If for w∈Σwzb2k there exists u∈/ScatFactk(w) with
u∈/{biak−i∣i∈[k−1]}∪{aibk−i∣i∈[k−1]}, then
∣ScatFactk(w)∣<2k−1.
Proof
Let be i∈[k−2]0. Consider firstly u=bras for r+s=k and
r∈[i]∪{k−i,…,k} and Σk\{u}⊃ScatFactk(w) for a word
w∈Σwzb2k.
If br+1as−1 is also not a scattered factor of w, the claim is
proven (in this case
two elements of Σk are missing in ScatFactk(w)). Assume
br+1as−1∈ScatFact(w). This implies that (possibly intertwined)
(s−1) occurrences
of a follow (r+1) occurrences of b. Since u is not a scattered
factor of w, after
these (s−1) as only bs may occur. If br−1asb is not a
scattered factor, the
claim is again proven and so suppose that it is one. This implies that the
(r−1) bs are
preceded by as and not by bs. This implies that br+1as−1 is
not a scattered
factor and that contradicts the assumption. Consider now u=u1brasbtu2 with ∣u∣=k
not to be a scattered factor of w for r,s,t∈N. Following the same
arguments as before, the
claim is proven if u1br−1asbt+1u2 is not a scattered factor
and
hence it is
assumed to be one. This implies that exactly ∣u1∣b bs occur before
br−1. This
implies that u1br+1asbt−1u2 is not a scattered factor of w
of
length k.
Analogously it can be proven that scattered factors containing the switch from
a to b and
back to a cannot lead to the cardinality 2k−1.∎
Proposition 14
For k∈N≥3 and w∈Σwzb2k, the set ScatFactk(w) has
2k−1 elements if and only if
w∈{(ab)ia2b2(ab)k−i−2∣i∈[k−2]0} (up
to renaming and reversal). In particular
ScatFactk(w)=Σk\{bi+1ak−i−1} holds for
w=(ab)ia2b2(ab)k−i−2 with i∈[k−2]0.
Proof
Let be i∈[k−2]0. First ”⇐” will be proven
and for that consider w=(ab)ia2b2(ab)k−i−2. By
Lemma 5 follows
[TABLE]
With
ScatFact2(a2b2)={aa,ab,bb} the k-spectrum of w
has
at least 3⋅2i⋅2k−i−2=3⋅2k−2=2k−2k−2 elements. Notice that by this
construction,
scattered factors with a ba at the middle position cannot be
reached. For this reason we have to have a look at w’s remaining scattered
factors not
being gained by the above construction. This means that not only i letters
are
allowed to be
taken of the first part and not only k−i−2 letters from the last part.
Having a deeper look into
(ab)i one can notice that all binary numbers
(encoded by a,b) of length i are scattered factors of
(ab)i−1a. Appending to these scattered factors a b implies
that nearly all binary numbers are in the
i+1-spectrum of
abi. Appending now an a from the middle part and then each of the
words from the last
part leads to nearly all remaining scattered factors of the k-spectrum of
w.
The only missing
word is bi+i, since the last b cannot be reached within the first
part. This implies
that the word bi+1ak−i−1 is not in the k-spectrum of w since
with the (i+1)th
b the middle part is reached and the last part contains only k−i−2
as.
This concludes
∣ScatFactk(w)∣=2k−1.
On the other hand if ∣ScatFactk(w)∣=2k−1 an element of the form
bi+1ak−i−1 for an
i∈[k−2]0 is missing in the k-spectrum of w. Moreover this is exactly
the only element
missing. Fix an i∈[k−2]0 and set u=bi+1ak−i−1. The proof will
be very technically
and exclude step by step all other possibilities than w being
(ab)ia2b2(ab)k−i−2. Firstly consider i=k−2. This implies
u=bk−1a.
In this case w has to end in b2 but not in b3 since otherwise
bk−2a2 would
not be a scattered factor. If w were of the form w1bab2,
∣w1∣a=k−1 and
∣w1∣b=k−3 would hold which would imply that bk−2a2 is not a
scattered factor. If w
ended in a3b2, ak−2ba would be excluded. Hence, w ends in
a2b2.
Suppose at last that w=(ab)ℓa2b2w2 holds for ℓ<k−2 and
w2∈Σ∗. Then w2 has each (k−ℓ−2) a and b. Thus
bℓ+1ak−ℓ−1 is not a scattered factor of length k. This
proofs that for i=k−2
w=(ab)k−2a2b2 is implied by bk−2a1 being the only
excluded scattered
factor from Σk. Hence assume i∈[k−3]0.
Supposition: w ends in bℓ for ℓ≥2
If i<k−2 holds, then bk−1a∈ScatFactk(w) follows and since
i+1<k−1 holds, this
element is different from u.
In the next step it will be shown that exactly k−i−2 repetitions of ab
are a suffix of
w.
Supposition: w=w1b2(ab)ℓ
If ℓ>k−i−2 held, bi+1ak−i−1 would not be a scattered factor of
w. If
ℓ<k−i−2 held, bk−ℓ−1aℓ+1 would not be a scattered factor
since w1 has
(k−1) a and (k−ℓ−2) b.
Supposition: w=w1a2(ba)ℓb
In this case ∣w1∣a=k−2−ℓ and ∣w1∣b=k−ℓ−1 holds. This implies
that
ak−2−ℓbℓ+1a is not in the k-spectrum of w.
Consequently there exists a w1 such that w=w1b2(ab)k−i−2 holds.
In the next it
will
be shown that b2 has to be preceded by a2.
Supposition: w=w1b3(ab)k−i−2
Here w1 has (i+2) a and (i−1) b and hence
biak−i−2b2
is not a
scattered
factor of length k of w.
Supposition: w=w1bab2(ab)k−i−2
This implies ai+2babk−i∈ScatFactk(w) since w1 has
i+1 occurrences of
a and i−1 occurrences of b.
This proofs that a2b2(ab)k−i−2 is a suffix of w. The case that
this is preceded
by
another a is excluded since then aibak−i−1 would not be in the
k-spectrum of
k.
In the last step it will be shown that the first occurrence of a2 is at
the
point 2ℓ.
Supposition: w=(ab)ℓa2w2 for ℓ=i
If ℓ is smaller than i, ∣w2∣a=k−ℓ−2 and ∣w2∣b=k−ℓ hold
and
bℓ+1ak−ℓ−1∈ScatFactk(w) follows. If ℓ is greater
than i, in
contradiction to the main assumption bi+1ak−i−1 is a scattered
factor, because
bi+1 is a scattered factor of (ab)ℓ and
k−ℓ+ℓ−(i+1)=k−i−1 a are
left
in the rest of w.
Combining w=(ab)ia2w2 and w=w1a2b2(ab)k−i−2 the
claim
that w is of the
form (ab)ia2b2(ab)k−i−2 is proven.∎
By Proposition 14 we get that 7 is a possible cardinality of the set of scattered factors of length 3 of
weakly-[math]-balanced words of length 6 and, moreover, that exactly the words
a2b2ab and aba2b2 (and symmetric words obtained by
reversal and renaming) have seven different scattered factors.
The following theorem demonstrates that there always exists a weakly-[math]-balanced
word w of length 2k such that ∣ScatFactk(w)∣=2k. Thus, for the case k=3 also the
question if six is a possible cardinality of ScatFact3(w) can be answered positively.
Theorem 4.15
The k-spectrum of a word w∈Σwzb2k has exactly 2k elements if
and only if
w∈{ak−1babk−1,ak−1bka} holds (up to renaming and
reversal). Moreover,
there does not exist a
weakly-[math]-balanced word w∈Σwzb2k with a k-spectrum of
cardinality 2k−i for i∈[k−2].
Proof
Consider first w=ak−1babk−1. Since the k-spectrum of
akbk is a subset of the k-spectrum of
w, the k-spectrum of w has at least
k+1 elements.
Additionally w
has the scattered factors of the form aibabk−2−i, which sum up to
k−1. Hence
∣ScatFactk(w)∣=k+1+k−1=2k holds. Moreover
ak−1bka has
all elements of akbk’s k-spectrum as scattered factors. Here the word
has in
addition
all words of the form aibk−1−ia as scattered factors which sum up to
k−1 as well.
This proves that both words have a scattered factor set of cardinality 2k.
The other direction will be proven by contraposition following the two main
cases
[TABLE]
Assume first w=aℓbx for ℓ∈[k−2]≥2. Notice that it
does not have to be
considered that the word starts with one a, since this is
symmetric to the reversal of the case ak−1bka. This implies
∣x∣a=k−ℓ and
∣x∣b=k−1. Notice here k−ℓ<k−1. Thus, there exists a scattered factor
x′ of x of
length
2(k−ℓ) with ∣x′∣a=∣x′∣b=k−ℓ. By Lemma 3.2 follows
[TABLE]
and ∣ScatFactk−ℓ(y)∣>k−ℓ+1 otherwise.
This implies that the (k−ℓ)-spectrum of x′ is minimal with respect
to cardinality if x′ is either
ak−ℓbk−ℓ or bk−ℓak−ℓ. For giving a lower
bound of the
cardinality of w’s scattered factor set of length k, it is sufficient to
only take these both
options into consideration. This implies that it is not necessary to examine the
cases where
x contains other scattered factors with both k−ℓ a and b.
case 1: x′=ak−ℓbk−ℓ
Thus x contains ℓ−1 b which are not in x′.
case a: x=bℓ−1ak−ℓbk−ℓ
In this case w=aℓbℓak−ℓbk−ℓ holds and
that the k-spectrum of akbk is a subset of ScatFactk(w) follows.
case i: ℓ<k−ℓ
For all s∈[ℓ] the words aℓ−sbsak−ℓ,…,aℓbsak−ℓ−s
are well-defined and sum up to s+1. Moreover for every s2∈[k−ℓ] exists
r1∈N0
and exist r2,s2∈N such that the words
ar1bs1ar2bs2
with s1+r1+s2+r2=k are all distinct and distinct to the aforementioned.
Thus, in this case
[TABLE]
is a lower bound for ScatFactk(w).
case ii: ℓ>k−ℓ
Consider here for r∈[k−ℓ] the words
bℓ−rarbk−ℓ,…,bℓarbk−ℓ−r. For fixed
r these are r+1.
Moreover in this case for all r1∈[ℓ] exist s1,r2∈N and s2∈N
such that the
words
ar1bs1ar2bs2 with s1+r1+s2+r2=l are all distinct
and distinct
to the aforementioned. In total this sums up to
[TABLE]
different scattered factors.
case b: x=ak−ℓbk−1
Thus, w=aℓbak−ℓbk−1 holds. Here it holds as well that
the k-spectrum of
akbk is a subset of ScatFactk(w). Moreover all words of the
form barbs for r+s=k−1 and r∈[k−ℓ] are different scattered
factors, i.e.
k−ℓ many. Additionally the words arbabs for r+s=k−2 and
r,s>0 are different
scattered factors and distinct to the aforementioned. This sums up to
k+1+k−1+k−2=3k−2 for the
cardinality of ScatFactk(w). This proves the claim for k≥3.
case 2: x′=bk−ℓak−ℓ
Consequently
x∈{bk−1ak−ℓ,bk−ℓak−ℓbℓ−1} holds.
case a: x=bk−1ak−ℓ
Hence w=aℓbkak−ℓ. Here only ℓ+1 different scattered
factors are of the
form arbs exist and k−ℓ of the form bsar with r+s=k
(notice that the latter
ones are only k−ℓ since among all of them one is in common with the first
ones). Finally
consider the words of the form ar1bsar2 with r1+r2+s=k and
r1,r2,s>0. This
sums up to ℓ+1+k−ℓ+k. By ak∈ScatFactk(w), ∣ScatFactk(w)∣≥2k+2 follows.
case b: x=bk−ℓak−ℓbℓ−1
In this case w=aℓbk−ℓ+1ak−ℓbℓ−1 holds. Here
the cardinality
of the k-spectrum of w is determined analogously to case 1a.∎
By Proposition 14 and Theorem 4.15 the possible cardinalities of ScatFact3(w) for weakly-[math]-balanced words w of length 6 are completely
characterized. Theorem 4.15 determines the first gap in the set of cardinalities of ∣ScatFactk(w)∣ for w∈Σwzb2k: there does not exist
a word w∈Σwzb2k with ∣ScatFactk(w)∣=k+i+1 for i∈[k−2] and
k≥3, since all
words that are not of the form akbk, bkak,ak−1babk−1, or
ak−1bka have a scattered factor set of cardinality at least
2k+1. As the size of this first gap is linear in k, it is clear that the larger k is, the
more unlikely it is to find a k-spectrum of a small cardinality.
In the following we will prove that the cardinalities 2k+1 up to 3k−4 are
not reachable, i.e. 3k−3 is the thirst smallest cardinality after k+1 and
2k (witnessed by, e.g. ak−2bka2).
Lemma 16
For i\in\big{[}\lfloor\frac{k}{2}\rfloor\big{]} and j∈[k−1]
∣ScatFactk(ak−ibkai)∣=k(i+1)−i2+1* for k≥4,*
∣ScatFactk(ak−1b2abk−2)∣=3k−2,
∣ScatFactk(ak−2bjabk−ja)∣=k(2j+2)−6j+2* for k≥5, and*
∣ScatFactk(ak−2bja2bk−j)∣=k(2j+1)−4j+2.
Proof
For the first claim, let be i\in\big{[}\lfloor\frac{k}{2}\rfloor\big{]}_{\geq 2}.
The k-spectrum of ak−ibkai contains exactly all words of the form
arbsat
with r+s+t=k, t∈[i]0, r∈[k−i]0, and s∈[k]0. If t and r are
fixed, s is
uniquely determined. Since all these scattered factors are different, the
k-spectrum has
(i+1)(k−i+1)=k(i+1)−i2−1 elements. Thus the first claim is proven.
For the second claim, notice that the scattered factors of ak−1b2abk−2 are of four different forms:
brabt,
arbsa, arbs, and arbs1abs2. Notice that all
these scattered
factors are different if in the second one s is chosen greater than or equal
to 1 and in the
last one r,s1,s2≥1 holds. The first and second one lead to two scattered
factors, since
for every s∈[2] there are enough a at the beginning for padding from the
left.
The third form leads to k+1 different
scattered as shown in Theorem 3.2. The last one is a little bit more
complicated.
Notice firstly that r is at most k−3 since s1,s2>0 holds. In this case
there exists only
one possibility for chosing s1 and s2, namely as 1. If r is k−4
there exist two
possibilities, namely s1=1 and s2=2 or vice versa. For r∈[k−5] there
exist always 2
possibilities for the bs between the as. This leads to 2(k−5)
possibilities. Allover
it sums up to 2+2+k+1+1+2+2(k−5)=8+3k−10=3k−2.
As in the proof of the second part, for the remaining parts the scattered
factors can be
categorized in the form
arbs, br1asbr2, ar1bsar2, and
ar1bs1ar2bs2, where with appropriate chosen exponents
no factors is
counted twice. Also as before, i can be chosen in
[⌊2k⌋],
since
otherwise the proof is analogous for k−i. The first form contributes k+1
elements. The second
and third form contribute 2i each, since s resp. r2 range in [2]. For
the last form a
distinction is necessary. If r=k−3 holds, ak−3bab is the only
scattered factor. If
r is smaller than k−3, 2i possibilities for each r∈[k−3] lead to
scattered factors.
Allover this sums up to k+1+2i+2i+1+2i(k−4)=k(2i+1)−4i+2. By this the first
claim is proven.
For the second claim again scattered factors of different forms will be
distinguished. Since also
here the minimal k-spectrum is a subset of the k-spectrum of w, these
k+1 elements
counts for the cardinality. There exists i many scattered factors of the form
arbsa2
and k−2 of the form arbsa, since with the last a all occurrences
of b are
before it. Assuming w.l.o.g. again that i is at most 2k only
bk−1a is a
scattered factor of the form bsar. The scattered factors of the form
br1abr2a contribute i many. The remaining two forms need
again a case analysis.
There exists exactly one scattered factor of the form
arbs1abs2 for r=k−3
and exactly one scattered factor of the form ar1bs1abs2a
for r1=k−4. If
r resp. r1 are smaller there exists i different scattered factors for
each choice of
r∈[k−4] resp. r1∈[k−5]. This sums up to
k+1+k−2+i+i+1+i+1+i(k−5)+1+k(i−4)=2k+2+3i+ik−5i+ik−4i=k(2+2i)−6i+2. ∎
Notice that for i∈[⌊2k⌋] the sequence
(k(2i+1)−4i+2)i is
increasing and its minimum is 3k−2 while for
i∈[⌊2k⌋] the
sequence (k(2i+2)−6i+2)i is
increasing and its minimum is 4k−4. The following lemma only gives lower
bounds for specific forms of words,
since, on the one hand, it proves to be sufficient for the Theorem 4.18 which describes the second gap, and, on
the other hand, the proofs show that the formulas describing the exact number of scattered
factors of a specific form are getting more and more complicated. It has to be
shown that also words starting with i letters a, for i∈[k−3],
have a k-spectrum of greater
(as lower is already excluded) cardinality. By
Lemma 16 only words with another transition from a’s
to b’s need to be considered, (w=ar1bs1w1ar1bs2).
W.l.o.g. we can assume s1 to be maximal, such that w1 starts with an
a,
and similarly, by
maximality of r2, ends with a b, thus only
words of
the form
ar1bs1…arnbsn have to be considered, and by
Proposition 5, it is sufficient to investigate n<k.
Lemma 17
∣ScatFactk(ak−2biabjabk−i−j)∣≥3k−3* for i,j∈[k−2], i+j≤k−1,*
∣ScatFactk(ak−2bs1ar1bs2ar2bs3)∣≥3k−4* for s1+s2+s3=k, r1+r2=2,s1>0, r1,r2,s2,s3≥0,*
∣ScatFactk(ar1bs1…arnbsn)∣≥3k−3*
for r1≤k−3,
∑i∈[n]ri=∑i∈[n]si=k, and ri,si≥1.*
Proof
For the first claim, choose i,j∈[k−2]. Then all words of the form arbs for r,s∈[k]0
are scattered
factors of wij and by Lemma 3.2 follows that wij has
k+1 scattered
factors of this form. Scattered factors of the form ar1bsar2
can occur in three
variants. In the first variant only the second block of a is involved after
the first block of
b, namely the second single a is not involved. Since i∈[k−2] holds,
for each s∈[i]
exists r1,r2 (r2=1) such that ar1bsar2 is a scattered
factor of wij,
i.e. wij has additionally i scattered factors. The second variant uses
the a of each
the
second and the third a-block. This only scattered factors of the form
ar1bsar2
are of interest, the second b-block is not involved. If i+j=k−1 holds only
i−1 scattered
factors of this form occurs, otherwise again i new elements are in the
k-spectrum. If only the
a from the third block is involved then j (resp. j−1) new elements are
in the spectrum.
This sums up to at least 2i+j−2 elements of the form
ar1bsar2. A similar
distinction leads to the number of scattered factors of the form
ar1bs1ar2bs2. Assume first r2=1 and for this only
the a from the
second a-block. This implies that either only b from the second block or
from the second
and third block can be taken for the last b-block in the scattered factor.
Moreover
r1,s1,s2 are at most k−3. For each choice of r1 in [k−3] there are
min{j,k−2−i}
possibilities, which leads to
[TABLE]
If b from the second and third block are allowed, all of the second block
have to occur for
obtaining different scattered factors to the previous ones. Thus,
[TABLE]
If both, the second and the third a-block, are involved
ik−121i2−ij−21i
additional scattered factors are in the k-spectrum. This all sums up to
[TABLE]
Since either i2≥ij or j2≥ij and i,j∈[k−3] hold, this is
greater than or equal to
[TABLE]
Notice that additionally there exist scattered factors of other forms, which
enlarge the concrete
k-spectrum.
For the second claim, consider first the case, when s2=0, r1=0, or r2=0. This leads to words
of the form
matching Lemma 16 and consequently the k-spectrum has
k(2i+1)−4i+2≥3k−2>3k−4
elements. Consider now the case that s3=0 holds and all other exponents are
at least 1. By
Lemma 16 follows again that each such word has at least
k(2i+2)−6i+2≥4k−4>3k−4
elements.
Finally by Lemma 17 follows that the remaining words of the given
form have at least
3k−3 scattered factors.
Finally notice that ak is a scattered factor and ak−ibi for sn also.
Notice here, that
the proof leads to sn−1 scattered factors, if in the claim sn=0 would be
allowed. Consider
now the scattered factors of the form aibj for i,j∈[k]. Let m be
the number of the
block in which the ith a occurs. If sm+⋯+sn≥k−i holds,
aibk−i is
a scattered factor of w. Consider the opposite. This implies that from the
mth till the
nth block less then k−i b occur. Thus in the blocks 1 to i there
occur more than i
b. Since the ith a is in the mth block, from this point till the
end there are
k−i a. Hence biak−i is a scattered factor of w. So in each
case at least one
scattered factor occurs, i.e. at least k+1 scattered factors of this form are
in the
k-spectrum.
Notice here, that the argument holds still if sm=0 is allowed. With a similar
argumentation the
number of occurrences of the form aibjak−i−j will be shown. If
for a specific
i,j-combination aibjak−i−j is not a scattered factor, then choose
m1,m2 such
that the ith a is in block m1 and the jth b after that is in
block m2. Thus
in the blocks m2+1 to n are less than k−i−j a. Let rm1′ be the
a in the
m_{1}$${}^{\text{th}} block which don’t belong to ai. Then rm1′+⋯+rm2
contains more than
k−j a since k−j−i a occur in the m1’th to the nth block.
Thus
arm1′bsm1…arm2bsm2′ is a scattered
factor of length at
least k+1 where sm2′ describes the part of the m_{2}$${}^{\text{th}} block until
the jth b.
If 1<m1,m2<n holds, bak−j−1bj−2 is a scattered factor of w.
If m1=m2=1
holds, ak−j−3bab is a scattered factor. If both are equal to n,
bak−j−1bj−2 is a scattered factor. In both cases the last b
exist even if
sm=0
holds, since the scattered factor ends in the examined block m2. If m1<m2
holds, there
exists a factor of length >k which can be narrowed to a factor starting in
a, ending in
b, and having at least one switch from b back to a and back to
b. This
concludes to at least (k−2)2 scattered factors of the form
aibjak−i−j (or a
different one in exchange). By k2−k+3≥3k−3 for k≥5 follows the
claim.∎
By Lemmas 16 and 17 we are able to prove the
following theorem, which
shows the second gap in the set of cardinalities of ScatFactk for words in Σwzb2k.
Theorem 4.18
For k≥5 there does not exist a word w∈Σwzb2k with
k-spectrum of cardinality
2k+i for i∈[k−4]. In other words, i.e. between 2k+1 and 3k−4 is a
cardinality-gap.
Proof
Theorems 3.2 and 4.15 show that exactly the words
akbk,
ak−1,babk−1, and ak−1bk a have k-spectra of
cardinality less than
or
equal to 2k. By Lemma 16 and 17 follows that
ak−2bka2 has a
k-spectrum of
cardinality 3k−3. Assume a
w∈Σwzb2k\{akbk,ak−1ba
bk−1,ak−1bka,ak−2bka2}. Since renaming and reversal do not influence the cardinality, it
can be assumed
w.l.o.g. that w starts with a. By assumption w does not start with
ak. If w starts
with ak−1, w=ak−1biabk−i follows with i∈[k−1]≥2 and by
Lemma 16 the k-spectrum has (i+1)k−4i+6≥3k−2>3k−4
elements.
By
Lemma 17 the claim follows for words starting with (k−2) a.
and it is shown that words starting with at least two and at most
k−3 a lead
to
k-spectra of cardinality greater than 3k−3.∎
Going further, we analyse the larger possible cardinalities of ScatFactk, trying to see what values are achievable (even if only asymptotically, in some cases).
Corollary 19
All square numbers, greater or equal to four, occur as the cardinality of the k-spectrum of a word w∈Σwzb2k;
in particular
∣ScatFactk(a2kbka2k)∣=(2k+1)2 holds for k
even.
Proof
Apply Lemma 16 to i=2k. This implies that the
cardinality of
the k-spectrum of a2kbka2k is
[TABLE]
Inspired by the previous Corollary, we can show the following result concerning the asymptotic behaviour of the cardinality of ScatFactk for words of length 2k.
Proposition 20
Let i>1 be a fixed (constant) integer. Let d=⌊ik⌋ and r=k−di, and
d′=⌊i−1k⌋ and r′=k−d′(i−1) . Then the following hold:
the word arbr(adbd)i has Θ(k2i−1) scattered
factors of length n;
the word arbr′(adbd′)i−1ad has Θ(k2i−2)
scattered factors of length n.
Proof
Let us first show the upper bounds. The following algorithm can be used to find
the scattered factors of length k of arbr(adbd)i. Choose
2 numbers q1 and q2 from [i]0, and 2i−1 integers
r1,…,r2i−1 from [d]0. Let r2i=k−(q1+q2+∑j∈[2i−1]rj). If r2i≥0 then the word
[TABLE]
is a scattered factor of arbr(adbd)i, and all scattered factors of length k of this word have this form.
From the construction of w′, because d≤ik, it follows that
there are at most O(i2k2i−1) possible ways to obtain it. As i is seen
as
a constant, this means that arbr(adbd)i has O(n2i−1)
scattered factors of length k.
In the same way one can show that arbr′(adbd′)i−1ad
has O(n2i−2) scattered factors of length n.
Let us now show the lower bounds. We first consider the word
arbr(adbd)i. As i is constant, let us assume that
k>i−1i(2i−1). Clearly, i(2i−1)k(i−1)<2i−1k≤ik−1≤d≤ik and d+r≥ik. We generate
scattered factors of the word arbr(adbd)i as follows. We
firstly choose 2i−1 integers r1,…,r2i−1 between
i(2i−1)k(i−1) and 2i−1k. Under our assumptions, the word
[TABLE]
is a
scattered factor of the suffix bd(adbd)i−1 of
arbr(adbd)i. Let r0=k−∑j∈[2i−1]rj. We have
r0≤ik≤d+r, so ar0w′′ is a scattered factor of
ar(adbd)i, so also of arbr(adbd)i. Moreover,
each choice of a tuple (r1,…,r2i−1) leads to a different scattered
factor of arbr(adbd)i. The total number of tuples we choose
is
[TABLE]
So the total number of scattered factors of length k of arbr(adbd)i is at least (i(2i−1)k)2i−1. As the total
number of scattered factors of length k of arbr(adbd)i is
also O(k2i−1), we get that arbr(adbd)i has
Θ(k2i−1) scattered factors of length k.
The proof that arbr′(adbd′)i−1ad has
Θ(n2i−2) scattered factors of length k follows in a very similar
manner.
∎
Remark 21
Let i be an integer, and consider k another integer divisible by i.
Consider the word wk=(aikbik)i. The exact number of
scattered factors of length k of wk equals to the number
C(k,2i,ik) of weak 2i-compositions of k, whose terms
are bounded by ik, i.e., the number of ways in which k can be
written as a sum ∑j∈[2i]rj where rj∈[ik]0. From Proposition 20 we also get that this number is Θ(n2k−1), but we also have:
[TABLE]
for M=k+ii(k+2i−1).
It is known that there exists a constant E>0 such that
[TABLE]
The coefficient of k2i−1 in the right hand side of this inequality has to
be positive. Consequently ∑0≤j<M(−1)j(j2i)(i−j)2i−1>0.
This seems to be an interesting combinatorial inequality in itself.
One can also show as in Proposition 20 that the number of scattered factors of length k of wk, which have,
at their turn, (ab)i as a scattered factor, is Θ(k2i−1). This
number also equals the number C′(k,2i,ik) of 2i-compositions
of k whose terms are strictly positive integers upper bounded by
ik, i.e., the number of ways in which k can be written as a sum
∑j∈[2i]rj where rj∈[ik]. Just as above,
from this we get ∑0≤j<i(−1)j(j2i)(i−j)2i−1>0.
Again, this inequality seems interesting to us.
We will end this analysis with the conjecture that, in contrast to the first gap,
which always starts
immediately after the first obtainable cardinality, the last gap ends earlier
the larger k is.
More precisely, if w=a2b2(ab)k−3−iba(ab)i for
k∈N≥4, i∈[k−2]0 then ∣ScatFactk(w)∣=2k−2−i.
At the end of this section, we will briefly introduce θ-palindromes in
this specific
setting. Let θ:Σ∗→Σ∗ be an antimorphic
involution, i.e.
θ(uv)=θ(v)θ(u) and θ2 is the identity on
Σ∗. By Σ={a,b} only the identity and renaming are
such mappings. The fixed points of θ are called θ-palindromes
(ab3.θ(b)3θ(a)) and exactly the words where
wR=w holds. They
were studied in different fields well (see e.g.,
[5],
[8]). A word w∈Σwzb2k is a
θ-palindrome
iff either w∈{aw′b,bw′a} for some θ-palindrome
w′∈Σwzb2(k−1)
or additionally w=a2kbka2k in the case that k
is even. Two
cardinality results for θ-palindromes are presented in
Lemma 16 and
Corollary 19. We believe that persuing the k-spectra of
θ-palindromes may
lead to a deeper insight of which cardinalities can be reached, but due to space
restrictions we
will only mention one conjecture here, which may already show that cardinalities
are somehow
propagating for θ-palindromes. Notice that this conjecture implies that
indeed similar to the second gap here
4k−4 is always
reached but that in contrast to the second gap, the third gap is not of the
form
4k−4−i for
i∈[k−4].
Conjecture 22
The k-spectrum of w=abk−1ak−1b has 4(k−1) elements and
moreover if w′=wR with a k-spectrum of cardinality ℓ∈N≥12
then the scattered factor set of awb has cardinality 241ℓ−5.
5 Reconstructing Weakly-[math]-Balanced Words from their k-Spectra
In the final section we consider the slightly different problem of reconstructing a word from its scattered factors, or more specifically in this case, k-spectra. More generally, we are interested in how much information about a (weakly-[math]-balanced) word w is
contained in its scattered factors, and more precisely, which scattered factors
are not necessary or useful for reconstructing the word w, or distinguishing it from others. Since w is a scattered factor of itself, it is trivial that the
scattered factor of length ∣w∣ is sufficient to uniquely reconstruct w. On
the other hand, all words over {a,b}∗ containing both letters will have the
same 1-spectrum. Thus we see that the length of the scattered factors of a
word w plays a role in how much information about w they contain. This
relationship is described more precisely by the following result of Dress and
Erdös [3] along with the fact that (cf. e.g.
Proposition 5) a word of length 2k is not uniquely determined by
its scattered factors of length k.
Proposition 23 (Dress and Erdös [3])
If ScatFactk+1(w)=ScatFactk+1(w′) holds for w,w′∈Σ≤2k
then w=w′ follows.
Proof (for w, w′ being
weakly-[math]-balanced)
We give a procedure for uniquely reconstructing w from ScatFactk(w). For
all
i,j∈N0 such that i+j=k, ask whether aibaj∈ScatFactk(w). Since
there are exactly i+j occurrences of a in w, all are accounted for in
the (potential)
scattered factor aibaj, and thus the answer is ‘yes’ if and only if
there are one or more
bs between the i$${}^{\text{th}} and (i+1)$${}^{\text{th}} occurrences of a in w. Hence after
these queries, we
know
exactly which as are consecutive (i.e. do not have a b between them) in
w. Similarly we
ask for all i,j∈N0 such that i+j=k, ask whether
biabj∈ScatFactk(w). By symmetry, this tells us exactly which bs are
consecutive. This is
sufficient
information to specify w completely.∎
In the proof of Proposition 23,
a pivotal role is played by scattered factors which
contain many as and a few bs or vice-versa. The
question arises as to whether this is due to the fact that these
scattered factors contain inherently more information about the structure of the
whole word than e.g., weakly-[math]-balanced ones. In the general case, the answer is,
sometimes at least, yes: we cannot distinguish between e.g. two words in
{a}∗ by their weakly-[math]-balanced scattered factors, as the only such factor is ε. The same problem arises for all words which have a sufficiently uneven ratio of as to bs.
However, if in addition we consider only weakly-[math]-balanced words, then the
situation changes. We conjecture that in fact, for these words w,
the weakly-[math]-balanced scattered factors are just as informative about the w as
the unbalanced ones. More formally, we believe the following adaptation of
Proposition 23 holds:
Conjecture 24
Let k∈N. Let k′=k+1 for odd k, and k′=k+2
for even k. Let
w,w′∈Σwzb2k such that ScatFactk′(w)∩Σwzbk′=ScatFactk′(w′)∩Σwzbk′. Then w=w′.
While we do not resolve the conjecture, we give an example of a subclass of
words for which it holds true, namely when there are at most two blocks of
bs (and therefore by symmetry if there are at most two blocks of
as).
Proposition 25
Let k∈N. If k is odd, then each word w∈a∗b∗a∗b∗a∗∩Σwzb2k is uniquely determined by the set ScatFactk+1(w)∩Σwzbk+1.
Similarly, if k is even, then each word w∈a∗b∗a∗b∗a∗∩Σwzb2k
is uniquely determined by the set ScatFactk+2(w)∩Σwzbk+2.
Proof
As in the proof of Proposition 23, we give an algorithm for
uniquely reconstructing w.
W.l.o.g., let k be odd. The case that k is even is easily adapted. Let w=aibjaℓbk−jak−i−ℓ and let S=ScatFactk+1(w)∩Σwzb∗.
Firstly, we shall deal with the case that ℓ=0. Note that we can decide
whether ℓ=0 by querying whether there exists a scattered factor u∈S
such that u∈a∗b+a+b+a∗. Now, if ℓ=0, we have w=aibkak−i. Since k is odd, exactly one of i,k−i will be at most
2k−1. We can decide which one by querying whether
a2k+1b2k+1∈S. W.l.og., suppose i≤2k−1 (so the query returns “no”). The other case is symmetric.
Then note that aib2k+1a2k+1−i∈S but
ai+1b2k+1a2k+1−i−1∈/S. Thus the exact
value of i (and therefore k−i) can be inferred directly from observing
scattered factors of this form in S.
Now consider the the case that ℓ=0. Note that there exists u∈b+a2k+1b+∩S if and only if ℓ≥2k+1. Suppose firstly that ℓ≥2k+1. Then i+(k−i−ℓ)≤2k−1. Thus we can determine i and (k−i−ℓ) (and
therefore ℓ) by looking for the maximum m1,m2 such that there exists u∈am1b+a+b+am2 with u∈S (i is the maximum m1
while k−i−ℓ is the maximum m2). Moreover, exactly one of j,k−j will be
less than 2k+1. We can decide which one by querying whether
a2k+1b2k+1∈S. If so, it must be that k−j≥2k+1. Suppose that this is the case (the other case is
symmetric). Then as before, we can determine the exact value of j by looking
at the scattered factors of the form bma+b+a∗ (i.e., j is the
maximum m) and we are done.
Finally, we consider the case that 0<ℓ<2k+1. Then ℓ can
be uniquely determined as the maximum m such that there exists u∈a∗b+amb+a∗ with u∈S. In order to determine i (or
equivalently k−i−ℓ), we look for the maximum m1,m2 such that there
exist u1∈am1b+a+b+a∗ and u2∈a∗b+a+b+am2 with u1,u2∈S. In particular at least one of
m1,m2 must be strictly less than 2k−1. If m1<2k−1,
then j=m1 and if m2<2k−1 then k−ℓ−i=m2. In either
case, since ℓ is already known, this uniquely determines both i and
k−i−ℓ.
It remains to determine j (or equivalently k−j). Recall that exactly one of
j,k−j will be less than 2k+1. Let m1 be the maximum m such
that there exists u∈a∗bma+b+a∗ with u∈S and let m2
be the maximum m such that there exists u∈a∗b+a+bma∗ with
u∈S. Note that m1,m2≤2k−1. If m1<2k−1
(resp.
m2<2k−1), then j=m1 (resp k−j=m2), and thus j and
j−k
can be inferred. If m1=m2=2k−1, then either j=2k−1
or k−j=2k−1. Now, if k−i−ℓ<2k+1, there exists u∈a∗b2k+1a+ak−i−ℓ with u∈S if and only if
j=2k+1 (in which case k−j=2k−1). On the other hand, if
k−i−ℓ≥2k+1, then i<2k+1 and there exists u∈aia+b2k+1a∗ with u∈S if and only if k−j=2k+1 (in which case j=2k−1). In either case, all
exponents
are known and we have uniquely reconstructed w.∎
The difficulty in proving Conjecture 24 seems to arise from the
fact that, for different pairs of words w,w’∈Σwzb, the set of
scattered factors which distinguish them, namely the symmetric difference of
ScatFactk(w)∩Σwzbk and ScatFactk(w′)∩Σwzbk
(for appropriate k), varies considerably, unlike with the proof(s) of
Proposition 23, where the set of distinguishing scattered factors
is always made up words of the same form, regardless of the choice of w and
w’. As an example, consider
the words w=ababab, w′=bababa, and
w′′=ababba. Then the symmetric difference of ScatFact4(w)∩Σwzb4 and
ScatFact4(w′)∩Σwzb4 is {aabb,bbaa}. On the other
hand, considering ScatFact4(w′)∩Σwzb4 and
ScatFact4(w′′)∩Σwzb4, the symmetric
difference is
{baab}.
6 Conclusions
We have considered properties of k-spectra of weakly-[math]-balanced words. In
particular, in Section 3
we give several insights into the structure of the set of all k-spectra of
weakly-[math]-balanced words
of length 2k by considering for which numbers n there exists w such that
the k-spectrum of
w has cardinality n. In particular, we characterise the first two gaps in
the possibilities for
each k which are regular (in the sense that the first and second gaps are
always from k+2 to
2k−1 and 2k+1 to 3k−4 (inclusive). On the other hand, we see that the
third gap is
considerably
less regular and thus resists a natural characterisation.
In Section 4, we consider the task of reconstructing
weakly-[math]-balanced words
from their k-spectra. We note that this is, in a sense, as hard as in the
general case, however, we also conjecture that even if we consider only the
scattered factors which are also weakly-[math]-balanced, then the situation remains
the same, in the sense that it can be achieved for the same choices of k.
Resolving this conjecture appears to require some new approach however since the
techniques for the general case are not easily adapted.
As mentioned at the end of Section 3 some of the weakly-[math]-balanced words are
θ-palindromes.
Since the θ-palindromes of length 2k are constructible from the ones of
length 2(k−1)
(except for each even k exactly one θ-palindrome) we surmised that the
structure and
properties propagate. Moreover we expected that the knowledge of the word’s
second half helps in
finding the cardinalities of the k-spectra. Nevertheless we were only able to
get results for
θ-palindromes in the same manner as for the other words, but we still
believe that the
structure of the θ-palindromes can reveal more insights with further
work.