Minimal Absent Words in Rooted and Unrooted Trees
Gabriele Fici, Pawe{\l} Gawrychowski

TL;DR
This paper extends the concept of minimal absent words to rooted and unrooted trees with labeled edges, providing bounds on their number and algorithms for efficient computation.
Contribution
It introduces the theory of minimal absent words for trees, establishes bounds on their size, and develops output-sensitive algorithms for their computation.
Findings
Bounds of O(nσ) for rooted trees and O(n^2σ) for unrooted trees on the size of MAW sets.
Algorithms to compute all minimal absent words in output-sensitive time.
Bounds are tight and algorithms are efficient for large trees.
Abstract
We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet of cardinality . We show that the set of minimal absent words of a rooted (resp. unrooted) tree with nodes has cardinality (resp. ), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp. unrooted) tree in output-sensitive time (resp. assuming an integer alphabet of size polynomial in .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: Dipartimento di Matematica e Informatica, Università di Palermo, Italy
11email: [email protected] 22institutetext: Institute of Computer Science, University of Wrocław, Poland
22email: [email protected]
Minimal Absent Words in Rooted and Unrooted Trees
Gabriele Fici 11
Paweł Gawrychowski 22
Abstract
We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet of cardinality . We show that the set of minimal absent words of a rooted (resp. unrooted) tree with nodes has cardinality (resp. ), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp. unrooted) tree in output-sensitive time (resp. assuming an integer alphabet of size polynomial in .
1 Introduction
Minimal absent words (a.k.a. minimal forbidden words or minimal forbidden factors) are a useful combinatorial tool for investigating words (strings). A word is absent from a word if does not occur (as a factor) in , and it is minimal if all its proper factors occur in . This definition naturally extends to languages of words closed under taking factors.
The theory of minimal absent words has been developed in a series of papers [5, 14, 25, 27, 3] (the reader is pointed to [18] for a survey on these results). Minimal absent words have then found applications in several areas, e.g., data compression [15, 16, 17, 28], on-line pattern matching [13], sequence comparison [10, 11], sequence assembly [20, 26], bioinformatics [9, 31, 19], musical data extraction [12].
Bounds on the number of minimal absent words have been extensively investigated. The upper bound on the number of minimal absent words of a word of length over an alphabet of size is [14, 27], and this is tight for integer alphabets [10]; in fact, for large alphabets, such as when , this bound is also tight even for minimal absent words having the same length [1].
Several algorithms are known to compute the set of minimal absent words of a word. State-of-the-art algorithms compute all minimal absent words of a word of length over an alphabet of size in time [14, 2] or in output-sensitive time [22, 11] for integer alphabets. Space-efficient data structures based on the Burrows-Wheeler transform can also be applied for this computation [7, 6].
For a finite set of words over an alphabet of size , the minimal absent words of the factorial closure of can be computed in [3], where is the sum of the lengths of the words of . Generalizations of minimal absent words have been considered for circular words [10, 21] and multi-dimensional shifts [4].
In this paper, we extend the theory of minimal absent words to trees. We consider trees with edges labeled by letters from an integer alphabet of cardinality polynomial in . In the case of a rooted tree T, every node is associated with a word , defined as the sequence of edge labels from to the root. A rooted tree T can therefore be seen as a set of words L_{\textsf{T}}=\{\textsf{str}(v)\mid v\mbox{ in \textsf{T}}\}, that we call the language of T. If T has nodes, then contains at most distinct words, each of which has length at most . We call a rooted tree T deterministic when the edges from a node to its children are labeled by pairwise distinct letters. Throughout the paper we will assume that all rooted trees are deterministic, which can be ensured without losing the generality thanks to the following lemma.
Lemma 1
Given a rooted tree T on nodes we can construct in time a deterministic rooted tree with the same set of corresponding words.
Proof.
The depth of a node of T is its distance from the root. We start with sorting, for every , the set of nodes at depth according to the labels of the edges leading to their parents. This can be done in total time with counting sort. Then, we construct by processing . Assuming that we have already identified, for every node , its corresponding node of , we need to construct and identify the nodes for every . We process all nodes in groups corresponding to the same letter on the edge leading to their parent (because of the initial sorting we already have these groups available). Denoting by the parent of in T, we check if has been already accessed while processing the group of , and if so we set to be the already created node of . Otherwise, we create a new edge outgoing from to a new node in and labeled with , and set to be . To check if has been already accessed while processing the current group (and retrieve the corresponding if this is the case) we simply allocate an array of size indexed by nodes of identified by number from . For every entry of we additionally store a timestamp denoting the most recent group for which the corresponding entry has been modified, and increase the timestamp after having processed the current group. ∎∎
One could also define the set of words corresponding to a rooted tree T by considering a set of words from the root to every node (in the literature this is sometimes called a forward trie, as opposed to a backward trie, cf. [24]). In our context, this distinction is meaningless, as the obtained languages are the same up to reversing all the words.
We say that a word , with , is a minimal absent word of a rooted tree T if is not a factor of any word in but there exist words and in (not necessarily distinct) such that is a factor of and is a factor of . That is, the set of minimal absent words of T is the set of minimal absent words of the factorial closure of the language . Since any word of length can be transformed into a unary rooted tree with nodes, some of the properties of minimal absent words for usual words can be transferred to rooted trees. Indeed, rooted trees are a strict generalization of words.
For unrooted trees, the definition of minimal absent words is analogous: We identify an unrooted tree T with the language of words corresponding to all (concatenations of labels of) simple paths that can be read in T from any of its nodes. The language contains words, each of which has length at most . We therefore define the set of minimal absent words of T as the set of minimal absent words of the language , which in this case is already closed under taking factors by definition.
Our results.
We prove that for any rooted tree with nodes there are minimal absent words, and we show that this bound is tight. For unrooted trees, we prove that the previous bound becomes , and we give an explicit construction that achieves this bound. We also consider the case of minimal absent words of fixed length and generalize a previously-known construction.
Furthermore, we present an algorithm that computes all the minimal absent words in a rooted tree T with nodes in output-sensitive time . This also yields an algorithm that computes all the minimal absent words in an unrooted tree T with nodes in time . Note that while it is plausible that an efficient algorithm could have been designed, as in the case of words, from a DAWG [22], the size of the DAWG of a backward/forward tree is superlinear [24], so it is not immediately clear if such an approach would lead to an optimal algorithm. Excluding the space necessary to store all the results, our algorithms need and space, respectively.
Our algorithms are designed in the word-RAM model with -bit words.
2 Bounds on the number of minimal absent words
Let T be a rooted tree with nodes and edges labeled by letters from an integer alphabet of cardinality polynomial in . Let the language of T be L_{\textsf{T}}=\{\textsf{str}(v)\mid v\mbox{ in \textsf{T}}\}, where is the sequence of edge labels from node to the root.
For convenience, we add a new root to T and an edge labeled by a new letter \$$ not belonging to \Sigma$L_{\textsf{T}}L_{\textsf{T}}uuL_{\textsf{T}}$$ does not belong to the original alphabet, the leaves of ST are in one-to-one correspondence with the nodes of T.
A word , with , is a minimal absent word for T if it is a minimal absent word for the factorial closure of , that is, if both and but not are factors of some words in . The set of minimal absent words of T is denoted by .
If , then occurs as a factor in some word of but never followed by letter , hence there exists a letter b^{\prime}\in\Sigma\cup\{\}ubub^{\prime}ubub\sigma$ and the number of edges of ST.
Theorem 2.1
The number of minimal absent words of a rooted tree with nodes whose edges are labeled by letters from an alphabet of size is .
Therefore, the same upper bound that holds for words also holds for rooted trees. As a consequence, we have that all known upper bounds for words, and constructions that realize them, are still valid for rooted trees.
In particular, one question that has been studied is whether the upper bound is still tight when one considers minimal absent words of a fixed length. Almirantis et al. [1, Lemma 2] showed that the upper bound for a fixed length of minimal absent words is tight if . Actually, they showed that it is possible to construct words of any length , with , having minimal absent words of length . We now give a construction that generalizes this result.
Let . For every , let be such that . Let . For every we define the word
[TABLE]
where \$$ is a new symbol not belonging to \Sigmaw_{i}2\sigma(k+2)+1nk=2|w_{i}|\leq\sigma^{k}\sigma\geq 9k>2|w_{i}|\leq\sigma^{k}\sigma+k\geq 7$..
Let and set , so that . We have that is a minimal absent word of for every and . So, has length and there are minimal absent words of of length .
Thus, we have proved the following theorem.
Theorem 2.2
A word of length over an alphabet of size can have minimal absent words all of the same length.
Observe that for , , therefore Theorem 2.2 strictly generalizes Almirantis et al.’s result.
Let now T be an unrooted tree. The number of distinct simple paths in T is . Since each minimal absent word is uniquely described by a pair such that is a simple path in T and is a letter, we have that the number of minimal absent words of T is upper-bounded by .
Theorem 2.3
The number of minimal absent words of an unrooted tree with nodes whose edges are labeled by letters from an alphabet of size is .
We now provide an example of an unrooted tree realizing this bound. Let Our unrooted tree T is built as follows:
- •
We first build a sequence of nodes such that every other node is connected to terminal nodes with edges labeled by and is connected to the next node of the sequence with an edge labeled by [math];
- •
Then, we attach to each of the last nodes of the previous sequence simple paths composed of nodes with edges labeled by [math].
See Figure 1 for an illustration.
In total, T has nodes. We therefore set , so that .
It is readily verified that for every in and for every , there is a minimal absent word of the form (the prefix can be found reading from the left part to the right part of the figure, while the suffix can be found reading from the right part to the left part, the letter being one of the labels of the edges joining the left and the right part). Hence, the number of minimal absent words of T is .
Remark 1
The previous construction can be modified in such a way that edges adjacent to a node have distinct labels, keeping the same bound on the number of minimal absent words.
3 Algorithms for computing minimal absent words
We now present an algorithm that computes the set of all minimal absent words of a rooted tree T with nodes in output-sensitive time .
We construct the suffix tree ST of T in time [30]. Recall that the leaves of ST are in one-to-one correspondence with the nodes of T and we can assume that every node of T stores a pointer to the leaf of ST corresponding to .
Definition 1
For every (implicit or explicit) node of ST, we define the set as the set of all letters such that can be spelled from the root of ST, i.e., there exists a node of T such that for some (possibly empty) word .
As already noted before, if is a minimal absent word of T, then occurs as a factor in some word of followed by a letter b^{\prime}\in\Sigma\cup\{\}bu$ is an explicit node of ST.
Lemma 2
Let be an explicit node of the suffix tree ST of the tree T. Let be the children of in the non-compacted trie from which we obtained ST, and let be the labels of the corresponding edges. Then, for every and every letter
[TABLE]
the word is a minimal absent word of T.
Conversely, every minimal absent word of T is of the form described above.
Proof.
Since does not belong to , then by definition the word does not belong to , but there exists such that , that is, is a factor of a word in . Hence, is a factor of a word in . Since is also a factor of a word in by construction, we have that is a minimal absent word of T.
Conversely, if is a minimal absent word of T, then occurs as a factor in some word of followed by different letters in \Sigma\cup\{\}$, hence it corresponds to an explicit node in ST, so all minimal absent words of T are found in this way. ∎∎
Definition 2
For every leaf of ST we define the set as the set of all letters such that for some node in T.
Lemma 3
For every (implicit or explicit) node of ST, we have A(u)=\bigcup\{B(u^{\prime})\mid u^{\prime}\mbox{ is a leaf in the subtree of \textit{ST}u}\}.
Proof.
Let be a leaf in the subtree of ST rooted at . Thus, the word is a prefix of the word , i.e., for some word . By definition, is the set of all letters such that for some node in T. That is, the set of all letters such that is a word in . On the other hand, by definition, is the set of all letters such that for some node of T and word . That is, the set of all letters such that is a word in for some word . ∎∎
We now show how to compute, in time proportional to the output size, the set .
We start with creating, for every letter , a list of all leaves such that sorted in preorder. The lists can be obtained in linear time by traversing all the non-root nodes , following the edge labeled by from to its parent , and finally following the pointer from to the leaf of ST corresponding to and adding to . Finally, because the preorder numbers are from the lists can be sorted in linear time with counting sort.
Now we iterate over all letters . Due to Lemma 2, the goal is to extract all explicit nodes such that, for some child of such that the is the first letter on the edge from to , is a minimal absent word. By Lemma 3, this is equivalent to having a descendant (where possibly ) and not having any such descendant. This suggests that we should work with the subtree of ST, denoted , induced by all leaves . Formally, when for some leaf in the subtree of . Even though ST does not contain nodes with just one child, this is no longer the case for . Thus, we actually work with its compact version, denoted . Every node of stores a pointer to its corresponding node of ST. Assuming that ST has been preprocessed for constant-time Lowest Common Ancestor queries (which can be done in linear time and space [29, 8]), we can construct efficiently due to the following lemma.
Lemma 4
Given , we can construct in time.
Proof.
The procedure follows the general idea used in the well-known linear time procedure for creating a Cartesian tree [23]. We process the nodes in preorder and maintain a compact version of the subtree of ST induced by all the already-processed nodes. Additionally, we maintain a stack storing the edges on its rightmost path. Processing requires popping a number of edges from the stack, possibly splitting the topmost edge into two (with one immediately popped as well), and pushing a new edge ending at . Checking if an edge should be popped, and also determining if (and how) an edge should be split, can be implemented with LCA queries on ST, assuming that we maintain pointers to the corresponding nodes of ST. ∎∎
Having constructed , we need to consider two cases corresponding to being an explicit or an implicit node of . In the former case, we need to extract the edges outgoing from in ST such that there is no edge outgoing from the corresponding node in starting with the same letter , and output as a minimal absent word. Assuming that the outgoing edges are sorted by their first letters, this can be easily done in time proportional to the degree of plus the number of extracted letters. In the latter case, let the implicit node belong to an edge connecting to in , and let and be their corresponding nodes in ST with being an ancestor of . We iterate through all explicit nodes between and in ST and for each such node we extract all of its outgoing edges. For each such edge we check if belongs to the subtree rooted at its endpoint other than , and if not, extract its first letter to output as a minimal absent word.
The overall time for every letter can be bounded by the sum of the size of and the number of generated minimal absent words. Because and the size of can be bounded by , the total time complexity is .
The previous algorithm can be used to design an algorithm that outputs all the minimal absent words of an unrooted tree T with nodes in time as follows. For every node of T, we create a rooted tree by fixing as the root. Then we merge all trees into a single tree T of size by identifying their roots. Finally, we apply Lemma 1 to make T deterministic and apply our algorithm for rooted trees in total time.
4 Acknowledgments
This research was carried out during a visit of the first author to the Institute of Computer Science of the University of Wrocław, supported by grant CORI-2018-D-D11-010133 of the University of Palermo. The first author is also supported by MIUR project PRIN 2017K7XPAN “Algorithms, Data Structures and Combinatorics for Machine Learning”.
We thank anonymous reviewers for helpful comments.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Y. Almirantis, P. Charalampopoulos, J. Gao, C. S. Iliopoulos, M. Mohamed, S. P. Pissis, and D. Polychronopoulos. On avoided words, absent words, and their application to biological sequence analysis. Algorithms for Molecular Biology , 12(1):5:1–5:12, 2017.
- 2[2] C. Barton, A. Héliou, L. Mouchard, and S. P. Pissis. Linear-time computation of minimal absent words using suffix array. BMC Bioinformatics , 15:388, 2014.
- 3[3] M. Béal, M. Crochemore, F. Mignosi, A. Restivo, and M. Sciortino. Computing forbidden words of regular languages. Fundam. Inform. , 56(1-2):121–135, 2003.
- 4[4] M. Béal, F. Fiorenzi, and F. Mignosi. Minimal forbidden patterns of multi-dimensional shifts. IJAC , 15(1):73–93, 2005.
- 5[5] M. Béal, F. Mignosi, and A. Restivo. Minimal forbidden words and symbolic dynamics. In STACS , volume 1046 of Lecture Notes in Computer Science , pages 555–566. Springer, 1996.
- 6[6] D. Belazzougui and F. Cunial. A framework for space-efficient string kernels. Algorithmica , 79(3):857–883, 2017.
- 7[7] D. Belazzougui, F. Cunial, J. Kärkkäinen, and V. Mäkinen. Versatile succinct representations of the bidirectional Burrows-Wheeler transform. In H. L. Bodlaender and G. F. Italiano, editors, Algorithms - ESA 2013 - 21st Annual European Symposium, Sophia Antipolis, France, September 2-4, 2013. Proceedings , volume 8125 of Lecture Notes in Computer Science , pages 133–144. Springer, 2013.
- 8[8] M. A. Bender and M. Farach-Colton. The LCA problem revisited. In G. H. Gonnet and A. Viola, editors, LATIN 2000: Theoretical Informatics , pages 88–94, Berlin, Heidelberg, 2000. Springer Berlin Heidelberg.
