Minimal Absent Words in Rooted and Unrooted Trees

Gabriele Fici; Pawe{\l} Gawrychowski

arXiv:1907.12034·cs.DS·October 31, 2019

Minimal Absent Words in Rooted and Unrooted Trees

Gabriele Fici, Pawe{\l} Gawrychowski

PDF

TL;DR

This paper extends the concept of minimal absent words to rooted and unrooted trees with labeled edges, providing bounds on their number and algorithms for efficient computation.

Contribution

It introduces the theory of minimal absent words for trees, establishes bounds on their size, and develops output-sensitive algorithms for their computation.

Findings

01

Bounds of O(nσ) for rooted trees and O(n^2σ) for unrooted trees on the size of MAW sets.

02

Algorithms to compute all minimal absent words in output-sensitive time.

03

Bounds are tight and algorithms are efficient for large trees.

Abstract

We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet $Σ$ of cardinality $σ$ . We show that the set $MAW (T)$ of minimal absent words of a rooted (resp. unrooted) tree $T$ with $n$ nodes has cardinality $O (nσ)$ (resp. $O (n^{2} σ)$ ), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp. unrooted) tree in output-sensitive time $O (n + ∣ MAW (T) ∣)$ (resp. $O (n^{2} + ∣ MAW (T) ∣)$ assuming an integer alphabet of size polynomial in $n$ .

Equations4

w_{i} = $1 s_{i} $ s_{i} 1$2 s_{i} $ s_{i} 2$ \dots $ σ s_{i} $ s_{i} σ $,

w_{i} = $1 s_{i} $ s_{i} 1$2 s_{i} $ s_{i} 2$ \dots $ σ s_{i} $ s_{i} σ $,

a_{j} \in (A (u_{1}) \cup \dots \cup A (u_{k})) ∖ A (u_{i}),

a_{j} \in (A (u_{1}) \cup \dots \cup A (u_{k})) ∖ A (u_{i}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: Dipartimento di Matematica e Informatica, Università di Palermo, Italy

11email: [email protected] 22institutetext: Institute of Computer Science, University of Wrocław, Poland

22email: [email protected]

Minimal Absent Words in Rooted and Unrooted Trees

Gabriele Fici 11

Paweł Gawrychowski 22

Abstract

We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet $\Sigma$ of cardinality $\sigma$ . We show that the set $\text{MAW}(T)$ of minimal absent words of a rooted (resp. unrooted) tree $T$ with $n$ nodes has cardinality $O(n\sigma)$ (resp. $O(n^{2}\sigma)$ ), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp. unrooted) tree in output-sensitive time $O(n+|\text{MAW}(T)|)$ (resp. $O(n^{2}+|\text{MAW}(T)|)$ assuming an integer alphabet of size polynomial in $n$ .

1 Introduction

Minimal absent words (a.k.a. minimal forbidden words or minimal forbidden factors) are a useful combinatorial tool for investigating words (strings). A word $u$ is absent from a word $w$ if $u$ does not occur (as a factor) in $w$ , and it is minimal if all its proper factors occur in $w$ . This definition naturally extends to languages of words closed under taking factors.

The theory of minimal absent words has been developed in a series of papers [5, 14, 25, 27, 3] (the reader is pointed to [18] for a survey on these results). Minimal absent words have then found applications in several areas, e.g., data compression [15, 16, 17, 28], on-line pattern matching [13], sequence comparison [10, 11], sequence assembly [20, 26], bioinformatics [9, 31, 19], musical data extraction [12].

Bounds on the number of minimal absent words have been extensively investigated. The upper bound on the number of minimal absent words of a word of length $n$ over an alphabet of size $\sigma$ is $O(n\sigma)$ [14, 27], and this is tight for integer alphabets [10]; in fact, for large alphabets, such as when $\sigma\geq\sqrt{n}$ , this bound is also tight even for minimal absent words having the same length [1].

Several algorithms are known to compute the set of minimal absent words of a word. State-of-the-art algorithms compute all minimal absent words of a word of length $n$ over an alphabet of size $\sigma$ in time $O(n\sigma)$ [14, 2] or in output-sensitive $O(n+|\text{MAW}(w)|)$ time [22, 11] for integer alphabets. Space-efficient data structures based on the Burrows-Wheeler transform can also be applied for this computation [7, 6].

For a finite set of words $P$ over an alphabet of size $\sigma$ , the minimal absent words of the factorial closure of $P$ can be computed in $O(|P|^{2}\sigma)$ [3], where $|P|$ is the sum of the lengths of the words of $P$ . Generalizations of minimal absent words have been considered for circular words [10, 21] and multi-dimensional shifts [4].

In this paper, we extend the theory of minimal absent words to trees. We consider trees with edges labeled by letters from an integer alphabet $\Sigma$ of cardinality $\sigma$ polynomial in $n$ . In the case of a rooted tree T, every node $v$ is associated with a word $\textsf{str}(v)$ , defined as the sequence of edge labels from $v$ to the root. A rooted tree T can therefore be seen as a set of words $L_{\textsf{T}}=\{\textsf{str}(v)\mid v\mbox{ in$ \textsf{T} $}\}$ , that we call the language of T. If T has $n$ nodes, then $L_{\textsf{T}}$ contains at most $n$ distinct words, each of which has length at most $n$ . We call a rooted tree T deterministic when the edges from a node to its children are labeled by pairwise distinct letters. Throughout the paper we will assume that all rooted trees are deterministic, which can be ensured without losing the generality thanks to the following lemma.

Lemma 1

Given a rooted tree T on $n$ nodes we can construct in $O(n)$ time a deterministic rooted tree $\textsf{T}^{\prime}$ with the same set of corresponding words.

Proof.

The depth of a node of T is its distance from the root. We start with sorting, for every $d=1,2,..$ , the set of nodes $S(d)$ at depth $d$ according to the labels of the edges leading to their parents. This can be done in $O(n)$ total time with counting sort. Then, we construct $\textsf{T}^{\prime}$ by processing $S(0),S(1),S(2),..$ . Assuming that we have already identified, for every node $u\in S(d)$ , its corresponding node $f(u)$ of $\textsf{T}^{\prime}$ , we need to construct and identify the nodes $f(u^{\prime})$ for every $u^{\prime}\in S(d+1)$ . We process all nodes $u^{\prime}\in S(d+1)$ in groups corresponding to the same letter $a$ on the edge leading to their parent (because of the initial sorting we already have these groups available). Denoting by $u$ the parent of $u^{\prime}$ in T, we check if $f(u)$ has been already accessed while processing the group of $a$ , and if so we set $f(u^{\prime})$ to be the already created node of $\textsf{T}^{\prime}$ . Otherwise, we create a new edge outgoing from $f(u)$ to a new node $v$ in $\textsf{T}^{\prime}$ and labeled with $a$ , and set $f(u^{\prime})$ to be $v$ . To check if $f(u)$ has been already accessed while processing the current group (and retrieve the corresponding $f(u^{\prime})$ if this is the case) we simply allocate an array $A$ of size $n$ indexed by nodes of $\textsf{T}^{\prime}$ identified by number from $\{1,2,\ldots,n\}$ . For every entry of $A$ we additionally store a timestamp denoting the most recent group for which the corresponding entry has been modified, and increase the timestamp after having processed the current group. ∎∎

One could also define the set of words corresponding to a rooted tree T by considering a set of words from the root to every node $v$ (in the literature this is sometimes called a forward trie, as opposed to a backward trie, cf. [24]). In our context, this distinction is meaningless, as the obtained languages are the same up to reversing all the words.

We say that a word $aub$ , with $a,b\in\Sigma$ , is a minimal absent word of a rooted tree T if $aub$ is not a factor of any word $\textsf{str}(v)$ in $L_{\textsf{T}}$ but there exist words $\textsf{str}(v_{1})$ and $\textsf{str}(v_{2})$ in $L_{\textsf{T}}$ (not necessarily distinct) such that $au$ is a factor of $\textsf{str}(v_{1})$ and $ub$ is a factor of $\textsf{str}(v_{2})$ . That is, the set $\text{MAW}(\textsf{T})$ of minimal absent words of T is the set of minimal absent words of the factorial closure of the language $L_{\textsf{T}}$ . Since any word of length $n$ can be transformed into a unary rooted tree with $n+1$ nodes, some of the properties of minimal absent words for usual words can be transferred to rooted trees. Indeed, rooted trees are a strict generalization of words.

For unrooted trees, the definition of minimal absent words is analogous: We identify an unrooted tree T with the language of words $L(\textsf{T})$ corresponding to all (concatenations of labels of) simple paths that can be read in T from any of its nodes. The language $L(\textsf{T})$ contains $O(n^{2})$ words, each of which has length at most $n$ . We therefore define the set $\text{MAW}(\textsf{T})$ of minimal absent words of T as the set of minimal absent words of the language $L(\textsf{T})$ , which in this case is already closed under taking factors by definition.

Our results.

We prove that for any rooted tree with $n$ nodes there are $O(n\sigma)$ minimal absent words, and we show that this bound is tight. For unrooted trees, we prove that the previous bound becomes $O(n^{2}\sigma)$ , and we give an explicit construction that achieves this bound. We also consider the case of minimal absent words of fixed length and generalize a previously-known construction.

Furthermore, we present an algorithm that computes all the minimal absent words in a rooted tree T with $n$ nodes in output-sensitive time $O(n+|\text{MAW}(\textsf{T})|)$ . This also yields an algorithm that computes all the minimal absent words in an unrooted tree T with $n$ nodes in time $O(n^{2}+|\text{MAW}(\textsf{T})|)$ . Note that while it is plausible that an efficient algorithm could have been designed, as in the case of words, from a DAWG [22], the size of the DAWG of a backward/forward tree is superlinear [24], so it is not immediately clear if such an approach would lead to an optimal algorithm. Excluding the space necessary to store all the results, our algorithms need $O(n)$ and $O(n^{2})$ space, respectively.

Our algorithms are designed in the word-RAM model with $\Omega(\log n)$ -bit words.

2 Bounds on the number of minimal absent words

Let T be a rooted tree with $n$ nodes and edges labeled by letters from an integer alphabet $\Sigma$ of cardinality $\sigma$ polynomial in $n$ . Let the language of T be $L_{\textsf{T}}=\{\textsf{str}(v)\mid v\mbox{ in$ \textsf{T} $}\}$ , where $\textsf{str}(v)$ is the sequence of edge labels from node $v$ to the root.

For convenience, we add a new root to T and an edge labeled by a new letter $\$$ not belonging to$ \Sigma $from the new root to the old root. This corresponds to appending$ $ $at the end of each word of$ L_{\textsf{T}} $. We then arrange all the words of$ L_{\textsf{T}} $in a trie. Each node$ u $of this trie corresponds to a word obtained by concatenating the edges from the root of the trie to node$ u $, so in this paper we will implicitly identify a node of the trie with the corresponding word in the set of prefixes of$ L_{\textsf{T}} $. Following a standard approach, if we compact this trie by collapsing maximal chains of edges with every inner node having exactly one child and edges labeled by words, we obtain the suffix tree *ST* of T. The nodes in *ST* (the branching nodes) are called explicit nodes, while the nodes of the trie that have been collapsed (the non-branching nodes) are called implicit. Because$ $$ does not belong to the original alphabet, the leaves of ST are in one-to-one correspondence with the nodes of T.

A word $aub$ , with $a,b\in\Sigma$ , is a minimal absent word for T if it is a minimal absent word for the factorial closure of $L_{\textsf{T}}$ , that is, if both $au$ and $ub$ but not $aub$ are factors of some words in $L_{\textsf{T}}$ . The set of minimal absent words of T is denoted by $\text{MAW}(\textsf{T})$ .

If $aub\in\text{MAW}(\textsf{T})$ , then $au$ occurs as a factor in some word of $L_{\textsf{T}}$ but never followed by letter $b$ , hence there exists a letter $b^{\prime}\in\Sigma\cup\{\$ } $such that$ ub $and$ ub^{\prime} $can be read in *ST* spelled from the root (possibly terminating in an implicit node). This implies that$ u $corresponds to an explicit node in *ST*, and$ b $is the first letter on its outgoing edge. Consequently,$ ub $can be identified with an edge of *ST*, so the number of minimal absent words of T is upper-bounded by the product of$ \sigma$ and the number of edges of ST.

Theorem 2.1

The number of minimal absent words of a rooted tree with $n$ nodes whose edges are labeled by letters from an alphabet of size $\sigma$ is $O(n\sigma)$ .

Therefore, the same upper bound that holds for words also holds for rooted trees. As a consequence, we have that all known upper bounds for words, and constructions that realize them, are still valid for rooted trees.

In particular, one question that has been studied is whether the upper bound $O(n\sigma)$ is still tight when one considers minimal absent words of a fixed length. Almirantis et al. [1, Lemma 2] showed that the upper bound $O(n\sigma)$ for a fixed length of minimal absent words is tight if $\sqrt{n}<\sigma\leq n$ . Actually, they showed that it is possible to construct words of any length $n$ , with $\sigma\leq n\leq\sigma(\sigma-1)$ , having $\Omega(n\sigma)$ minimal absent words of length $3$ . We now give a construction that generalizes this result.

Let $\Sigma=\{1,2,\ldots,\sigma\}$ . For every $n$ , let $k>1$ be such that $\sigma^{k}\leq n<\sigma^{k+1}$ . Let $\Sigma^{k}=\{s_{1},s_{2},\ldots,s_{\sigma^{k}}\}$ . For every $1\leq i\leq\sigma^{k}$ we define the word

[TABLE]

where $\$$ is a new symbol not belonging to$ \Sigma $. The length of each word$ w_{i} $is$ 2\sigma(k+2)+1 $, which is smaller than$ n $up to excluding small cases 111The reader may verify that for$ k=2 $,$ |w_{i}|\leq\sigma^{k} $as soon as$ \sigma\geq 9 $; for$ k>2 $,$ |w_{i}|\leq\sigma^{k} $as soon as$ \sigma+k\geq 7$..

Let $\ell=\lfloor{n/|w_{i}|}\rfloor$ and set $w=w_{1}w_{2}\cdots w_{\ell}$ , so that $|w|>n/2$ . We have that $as_{i}b$ is a minimal absent word of $w$ for every $a,b\in\Sigma$ and $1\leq i\leq\ell$ . So, $w$ has length $\Theta(k\sigma\ell)$ and there are $\Theta(\sigma^{2}\ell)$ minimal absent words of $w$ of length $k+2$ .

Thus, we have proved the following theorem.

Theorem 2.2

A word of length $n$ over an alphabet of size $\sigma$ can have $\Omega(n\sigma/\log_{\sigma}n)$ minimal absent words all of the same length.

Observe that for $\sqrt{n}<\sigma\leq n$ , $\log_{\sigma}n=\Theta(1)$ , therefore Theorem 2.2 strictly generalizes Almirantis et al.’s result.

Let now T be an unrooted tree. The number of distinct simple paths in T is $O(n^{2})$ . Since each minimal absent word $aub$ is uniquely described by a pair $(au,b)$ such that $au$ is a simple path in T and $b$ is a letter, we have that the number of minimal absent words of T is upper-bounded by $O(n^{2}\sigma)$ .

Theorem 2.3

The number of minimal absent words of an unrooted tree with $n$ nodes whose edges are labeled by letters from an alphabet of size $\sigma$ is $O(n^{2}\sigma)$ .

We now provide an example of an unrooted tree realizing this bound. Let $\Sigma=\{0,1,\ldots,s\}.$ Our unrooted tree T is built as follows:

•

We first build a sequence of $N+1$ nodes such that every other node is connected to $s$ terminal nodes with edges labeled by $1,2,\ldots,s$ and is connected to the next node of the sequence with an edge labeled by [math];

•

Then, we attach to each of the last nodes of the previous sequence $s$ simple paths composed of $N$ nodes with edges labeled by [math].

See Figure 1 for an illustration.

In total, T has $(s+1)N+s(N+1)+1$ nodes. We therefore set $n=(s+1)N+s(N+1)+1$ , so that $n=\Theta(sN)$ .

It is readily verified that for every $a,b,c$ in $\Sigma\setminus\{0\}$ and for every $0<j,k\leq N$ , there is a minimal absent word of the form $a0^{j}b0^{k}c$ (the prefix $a0^{j}b0^{k}$ can be found reading from the left part to the right part of the figure, while the suffix $0^{j}b0^{k}c$ can be found reading from the right part to the left part, the letter $b$ being one of the labels of the edges joining the left and the right part). Hence, the number of minimal absent words of T is $\Omega(s^{3}N^{2})=\Omega(n^{2}\sigma)$ .

Remark 1

The previous construction can be modified in such a way that edges adjacent to a node have distinct labels, keeping the same bound on the number of minimal absent words.

3 Algorithms for computing minimal absent words

We now present an algorithm that computes the set $\text{MAW}(\textsf{T})$ of all minimal absent words of a rooted tree T with $n$ nodes in output-sensitive time $O(n+|\text{MAW}(\textsf{T})|)$ .

We construct the suffix tree ST of T in time $O(n)$ [30]. Recall that the leaves of ST are in one-to-one correspondence with the nodes of T and we can assume that every node $u$ of T stores a pointer to the leaf of ST corresponding to $\textsf{str}(u)$ .

Definition 1

For every (implicit or explicit) node $u$ of ST, we define the set $A(u)$ as the set of all letters $a\in\Sigma$ such that $au$ can be spelled from the root of ST, i.e., there exists a node $v$ of T such that $\textsf{str}(v)=auz$ for some (possibly empty) word $z$ .

As already noted before, if $aub$ is a minimal absent word of T, then $au$ occurs as a factor in some word of $L_{\textsf{T}}$ followed by a letter $b^{\prime}\in\Sigma\cup\{\$ } $different from$ b $, hence$ u$ is an explicit node of ST.

Lemma 2

Let $u$ be an explicit node of the suffix tree ST of the tree T. Let $u_{1},u_{2},\ldots,u_{k}$ be the children of $u$ in the non-compacted trie from which we obtained ST, and let $b_{1},b_{2},\ldots,b_{k}$ be the labels of the corresponding edges. Then, for every $1\leq i\leq k$ and every letter

[TABLE]

the word $a_{j}ub_{i}$ is a minimal absent word of T.

Conversely, every minimal absent word of T is of the form $a_{j}ub_{i}$ described above.

Proof.

Since $a_{j}$ does not belong to $A(u_{i})$ , then by definition the word $a_{j}ub_{i}$ does not belong to $L_{\textsf{T}}$ , but there exists $\ell\neq i$ such that $a_{j}\in A(u_{\ell})$ , that is, $a_{j}ub_{\ell}$ is a factor of a word in $L_{\textsf{T}}$ . Hence, $a_{j}u$ is a factor of a word in $L_{\textsf{T}}$ . Since $ub_{i}$ is also a factor of a word in $L_{\textsf{T}}$ by construction, we have that $a_{j}ub_{i}$ is a minimal absent word of T.

Conversely, if $a_{j}ub_{i}$ is a minimal absent word of T, then $u$ occurs as a factor in some word of $L_{\textsf{T}}$ followed by different letters in $\Sigma\cup\{\$ }$, hence it corresponds to an explicit node in ST, so all minimal absent words of T are found in this way. ∎∎

Definition 2

For every leaf $u$ of ST we define the set $B(u)$ as the set of all letters $a\in\Sigma$ such that $au=\textsf{str}(v)$ for some node $v$ in T.

Lemma 3

For every (implicit or explicit) node $u$ of ST, we have $A(u)=\bigcup\{B(u^{\prime})\mid u^{\prime}\mbox{ is a leaf in the subtree of$ \textit{ST} $rooted at$ u $}\}$ .

Proof.

Let $u^{\prime}$ be a leaf in the subtree of ST rooted at $u$ . Thus, the word $u$ is a prefix of the word $u^{\prime}$ , i.e., $u^{\prime}=uz$ for some word $z$ . By definition, $B(u^{\prime})$ is the set of all letters $a\in\Sigma$ such that $au^{\prime}=\textsf{str}(v^{\prime})$ for some node $v^{\prime}$ in T. That is, the set of all letters $a\in\Sigma$ such that $au^{\prime}=auz$ is a word in $L_{\textsf{T}}$ . On the other hand, by definition, $A(u)$ is the set of all letters $a\in\Sigma$ such that $\textsf{str}(v)=auz$ for some node $v$ of T and word $z$ . That is, the set of all letters $a\in\Sigma$ such that $auz$ is a word in $L_{\textsf{T}}$ for some word $z$ . ∎∎

We now show how to compute, in time proportional to the output size, the set $\text{MAW}(\textsf{T})$ .

We start with creating, for every letter $a\in\Sigma$ , a list $L(a)$ of all leaves $u$ such that $a\in B(u)$ sorted in preorder. The lists can be obtained in linear time by traversing all the non-root nodes $v\in\textsf{T}$ , following the edge labeled by $a$ from $v$ to its parent $v^{\prime}$ , and finally following the pointer from $v^{\prime}$ to the leaf $v^{\prime\prime}$ of ST corresponding to $\textsf{str}(v^{\prime})$ and adding $v^{\prime\prime}$ to $L(a)$ . Finally, because the preorder numbers are from $[n]$ the lists can be sorted in linear time with counting sort.

Now we iterate over all letters $a\in\Sigma$ . Due to Lemma 2, the goal is to extract all explicit nodes $u\in\textit{ST}$ such that, for some child $u_{i}$ of $u$ such that the $b_{i}$ is the first letter on the edge from $u$ to $u_{i}$ , $aub_{i}$ is a minimal absent word. By Lemma 3, this is equivalent to $u$ having a descendant $u^{\prime}\in L(a)$ (where possibly $u=u^{\prime}$ ) and $u_{i}$ not having any such descendant. This suggests that we should work with the subtree of ST, denoted $\textit{ST}(a)$ , induced by all leaves $v\in L(a)$ . Formally, $u\in\textit{ST}(a)$ when $u^{\prime}\in L(a)$ for some leaf $u^{\prime}$ in the subtree of $u$ . Even though ST does not contain nodes with just one child, this is no longer the case for $\textit{ST}(a)$ . Thus, we actually work with its compact version, denoted $\textit{ST}(a)$ . Every node of $\textit{ST}(a)$ stores a pointer to its corresponding node of ST. Assuming that ST has been preprocessed for constant-time Lowest Common Ancestor queries (which can be done in linear time and space [29, 8]), we can construct $\textit{ST}(a)$ efficiently due to the following lemma.

Lemma 4

Given $L(a)$ , we can construct $\textit{ST}(a)$ in $O(|L(a)|)$ time.

Proof.

The procedure follows the general idea used in the well-known linear time procedure for creating a Cartesian tree [23]. We process the nodes $u\in L(a)$ in preorder and maintain a compact version of the subtree of ST induced by all the already-processed nodes. Additionally, we maintain a stack storing the edges on its rightmost path. Processing $u\in L(a)$ requires popping a number of edges from the stack, possibly splitting the topmost edge into two (with one immediately popped as well), and pushing a new edge ending at $u$ . Checking if an edge should be popped, and also determining if (and how) an edge should be split, can be implemented with LCA queries on ST, assuming that we maintain pointers to the corresponding nodes of ST. ∎∎

Having constructed $\textit{ST}(a)$ , we need to consider two cases corresponding to $u$ being an explicit or an implicit node of $\textit{ST}(a)$ . In the former case, we need to extract the edges outgoing from $u$ in ST such that there is no edge outgoing from the corresponding node in $\textit{ST}(a)$ starting with the same letter $b$ , and output $aub$ as a minimal absent word. Assuming that the outgoing edges are sorted by their first letters, this can be easily done in time proportional to the degree of $u$ plus the number of extracted letters. In the latter case, let the implicit node belong to an edge connecting $u$ to $v$ in $\textit{ST}(a)$ , and let $u^{\prime}$ and $v^{\prime}$ be their corresponding nodes in ST with $u^{\prime}$ being an ancestor of $v^{\prime}$ . We iterate through all explicit nodes between $u^{\prime}$ and $v^{\prime}$ in ST and for each such node we extract all of its outgoing edges. For each such edge we check if $v^{\prime}$ belongs to the subtree rooted at its endpoint other than $u$ , and if not, extract its first letter $b$ to output $aub$ as a minimal absent word.

The overall time for every letter $a\in\Sigma$ can be bounded by the sum of the size of $\textit{ST}(a)$ and the number of generated minimal absent words. Because $\sum_{a\in\Sigma}|L(a)|=O(n)$ and the size of $\textit{ST}(a)$ can be bounded by $O(|L(a)|)$ , the total time complexity is $O(n+|\text{MAW}(\textsf{T})|)$ .

The previous algorithm can be used to design an algorithm that outputs all the minimal absent words of an unrooted tree T with $n$ nodes in time $O(n^{2}+|\text{MAW}(\textsf{T})|)$ as follows. For every node $u$ of T, we create a rooted tree $\textsf{T}_{u}$ by fixing $u$ as the root. Then we merge all trees $\textsf{T}_{u}$ into a single tree T of size $O(n^{2})$ by identifying their roots. Finally, we apply Lemma 1 to make T deterministic and apply our algorithm for rooted trees in $O(n^{2})$ total time.

4 Acknowledgments

This research was carried out during a visit of the first author to the Institute of Computer Science of the University of Wrocław, supported by grant CORI-2018-D-D11-010133 of the University of Palermo. The first author is also supported by MIUR project PRIN 2017K7XPAN “Algorithms, Data Structures and Combinatorics for Machine Learning”.

We thank anonymous reviewers for helpful comments.

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Almirantis, P. Charalampopoulos, J. Gao, C. S. Iliopoulos, M. Mohamed, S. P. Pissis, and D. Polychronopoulos. On avoided words, absent words, and their application to biological sequence analysis. Algorithms for Molecular Biology , 12(1):5:1–5:12, 2017.
2[2] C. Barton, A. Héliou, L. Mouchard, and S. P. Pissis. Linear-time computation of minimal absent words using suffix array. BMC Bioinformatics , 15:388, 2014.
3[3] M. Béal, M. Crochemore, F. Mignosi, A. Restivo, and M. Sciortino. Computing forbidden words of regular languages. Fundam. Inform. , 56(1-2):121–135, 2003.
4[4] M. Béal, F. Fiorenzi, and F. Mignosi. Minimal forbidden patterns of multi-dimensional shifts. IJAC , 15(1):73–93, 2005.
5[5] M. Béal, F. Mignosi, and A. Restivo. Minimal forbidden words and symbolic dynamics. In STACS , volume 1046 of Lecture Notes in Computer Science , pages 555–566. Springer, 1996.
6[6] D. Belazzougui and F. Cunial. A framework for space-efficient string kernels. Algorithmica , 79(3):857–883, 2017.
7[7] D. Belazzougui, F. Cunial, J. Kärkkäinen, and V. Mäkinen. Versatile succinct representations of the bidirectional Burrows-Wheeler transform. In H. L. Bodlaender and G. F. Italiano, editors, Algorithms - ESA 2013 - 21st Annual European Symposium, Sophia Antipolis, France, September 2-4, 2013. Proceedings , volume 8125 of Lecture Notes in Computer Science , pages 133–144. Springer, 2013.
8[8] M. A. Bender and M. Farach-Colton. The LCA problem revisited. In G. H. Gonnet and A. Viola, editors, LATIN 2000: Theoretical Informatics , pages 88–94, Berlin, Heidelberg, 2000. Springer Berlin Heidelberg.