Fully-functional bidirectional Burrows-Wheeler indexes

Fabio Cunial; Djamal Belazzougui

arXiv:1901.10165·cs.DS·June 11, 2019

Fully-functional bidirectional Burrows-Wheeler indexes

Fabio Cunial, Djamal Belazzougui

PDF

Open Access

TL;DR

This paper introduces fully-functional bidirectional Burrows-Wheeler indexes supporting constant-time character addition and removal from any substring, enabling efficient, small-space, variable-order de Bruijn graph traversal.

Contribution

It presents new bidirectional BWT indexes that support both addition and removal of characters in constant or near-constant time, improving flexibility over previous structures.

Findings

01

Supports constant-time addition/removal from any substring

02

Uses space proportional to maximal repeats of T

03

Enables small-space, variable-order de Bruijn graph traversal

Abstract

Given a string $T$ on an alphabet of size $σ$ , we describe a bidirectional Burrows-Wheeler index that takes $O (∣ T ∣ lo g σ)$ bits of space, and that supports the addition \emph{and removal} of one character, on the left or right side of any substring of $T$ , in constant time. Previously known data structures that used the same space allowed constant-time addition to any substring of $T$ , but they could support removal only from specific substrings of $T$ . We also describe an index that supports bidirectional addition and removal in $O (lo g lo g ∣ T ∣)$ time, and that occupies a number of words proportional to the number of left and right extensions of the maximal repeats of $T$ . We use such fully-functional indexes to implement bidirectional, frequency-aware, variable-order de Bruijn graphs in small space, with no upper bound on their order, and supporting natural criteria for…

Figures2

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · semigroups and automata theory · Natural Language Processing Techniques

Full text

Fully-functional bidirectional Burrows-Wheeler indexes

Djamal Belazzougui

CAPA, DTISI, Centre de Recherche sur l’Information Scientifique et Technique

Algiers, Algeria.

[email protected]

and

Fabio Cunial

Max Planck Institute for Molecular Cell Biology and Genetics (MPI-CBG)

Center for Systems Biology Dresden (CSBD)

Dresden, Germany.

[email protected]

Abstract.

Given a string $T$ on an alphabet of size $\sigma$ , we describe a bidirectional Burrows-Wheeler index that takes $O(|T|\log{\sigma})$ bits of space, and that supports the addition and removal of one character, on the left or right side of any substring of $T$ , in constant time. Previously known data structures that used the same space allowed constant-time addition to any substring of $T$ , but they could support removal only from specific substrings of $T$ . We also describe an index that supports bidirectional addition and removal in $O(\log{\log{|T|}})$ time, and that takes a number of words proportional to the number of left and right extensions of the maximal repeats of $T$ . We use such fully-functional indexes to implement bidirectional, frequency-aware, variable-order de Bruijn graphs with no upper bound on their order, and supporting natural criteria for increasing and decreasing the order during traversal.

1. Introduction

A bidirectional index on a string $T$ is a data structure that represents any substring $W$ of $T$ as a constant-size descriptor that recapitulates the set of all starting positions of $W$ in $T$ , and the set of all ending positions of $W$ in $T$ . Such a representation allows extending $W$ with a character in both directions, enumerating the distinct characters that occur after $W$ in both directions, and switching direction during extension. All existing bidirectional indexes can be seen as updating positions in the suffix tree of $T$ and in the suffix tree of the reverse of $T$ , either literally, as in the affix tree [30, 49], or in compact representations, like the affix array [50] and the bidirectional Burrows-Wheeler transform (BWT) [47]. Synchronous bidirectional indexes maintain a position in both trees at every extension step, whereas asynchronous indexes maintain a position in just one tree, and compute the position in the other only when the user needs to change direction [18]. Applications of bidirectional indexes to bioinformatics, like read mapping with mismatches and searching for RNA secondary structures, have used until now the ability of bidirectional indexes to add characters both to the left and to the right of a string (an operation called extension: see e.g. [25, 28, 34, 45, 47, 50] for a small sampler), whereas removing characters from the left and from the right (called contraction) has only been conjectured to be useful [13, 18], and it has been supported efficiently just for right-maximal and left-maximal substrings of $T$ , respectively (defined in Section 2), or for strings that occur just once in $T$ , for which the implementation is straightforward (see e.g. [11, 38]).

In this paper we describe a simple method for removing characters from the left or from the right of any substring of $T$ , based just on the ability to measure the length of the maximal repeats of $T$ (defined in Section 2). Using the recent observation that all such lengths can be represented in $O(|T|)$ bits of space [7], we show that bidirectional contraction can be supported in constant time with the bidirectional BWT index described in [11], within the same space budget and without changing the complexity of its construction. Our contraction algorithm can also be implemented on top of an existing representation of the suffix tree, based on the Compact Directed Acyclic Word Graph (CDAWG), that takes a number of words proportional just to the number of left and right extensions of the maximal repeats of $T$ [8]: this yields an index that supports, in the same asymptotic space, bidirectional extension and contraction of any substring of $T$ in $O(\log{\log{|T|}})$ time.

Having both bidirectional extension and contraction enables several applications, among which a de Bruijn graph that stores the frequency of its $k$ -mers, allows for bidirectional navigation, and supports any value of $k$ , as well as increasing and decreasing the value of $k$ , with no limit on the maximum $k$ allowed. We call such a data structure an infinite-order de Bruijn graph, and we describe an implementation that takes $O(|T|\log{\sigma})$ bits of space (where $\sigma$ is the size of the alphabet), and that supports all operations in constant time, as well as another implementation that takes a number of words proportional to the left an right extensions of the maximal repeats of $T$ , and that supports all operations in $O(\log{\log{|T|}})$ time. The latter representation establishes a connection between de Bruijn graphs and CDAWGs that was not known before. Our query times are comparable to those of the variable-order, bidirectional representation described in [13], which supports navigation and changing order in $O(\log{K})$ time (assuming constant $\sigma$ ), but is frequency-oblivious and requires a maximum order $K$ to be specified during construction. This competitor has the advantage of taking just $O(m\log{K})$ bits of space, where $m$ is the number of distinct $K$ -mers, and of allowing the user to specify by how much the order should be changed in each query (the changes in order supported by our index are detailed in Sections 3 and 4). The variable-order representation described in [22] takes constant time (assuming constant $\sigma$ ) to implement changes in order that are similar to those supported by our index, and uses just $O(m)$ bits of space; however, it is unidirectional, frequency-oblivious, and it requires again a maximum $K$ to be known at construction time.

We conjecture that a de Bruijn graph representation based on the CDAWG might be useful for assembling the recently introduced PacBio CCS reads, which have the same 2% error rate as Illumina short reads but an average length of 15 kilobases (see e.g. [51]). Such read sets contain long exact repeats, of length up to ten thousand, so it might be desirable to set $k$ to large values and to decrease it dynamically, down to a minimum value $\tau$ . Moreover, most maximal repeats are short (Figure 1, bottom right), and we can remove from the CDAWG all maximal repeats shorter than $\tau$ , and all arcs adjacent to them, while still being able to represent all de Bruijn graphs of order at least $\tau$ (see Section 4). For practical values of $k$ , the number of nodes and arcs in such a pruned CDAWG grows more slowly than the number of distinct $k$ -mers (Figure 1, top right; reads from the Genome in a Bottle consortium111ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/

PacBio_CCS_15kb/), suggesting that our data structure might be competitive in space with the state of the art, whose size is proportional to the number of $k$ -mers for a specific value of $k$ . The same observation applies to repetitive datasets: for example, the de Bruijn graph of a set of individuals from the same species has applications in population genomics, and the de Bruijn graph of a set of genomes from related species is used in comparative genomics [35, 36]. In Figure 1, bottom left, we experiment with the concatenation of assemblies hg16, hg17, hg18, hg19 and hg38 of the human genome from the UCSC Genome Browser222http://hgdownload.soe.ucsc.edu/downloads.html#human (a benchmark dataset from [3, 36]), and we observe exact repeats of length up to 489 million. Our data structure might also be useful with noisy long reads after error correction. Even in short-read Illumina datasets, the number of maximal repeats and of their extensions after pruning is just a small multiple of the number of distinct $k$ -mers (Figure 1, top left; reads from the Illumina Platinum project333https://www.ebi.ac.uk/ena/data/view/PRJEB3381, run ERR194146, file ERR194146_1.fastq.gz, read length 101.).

Finally, recall that our de Bruijn graph representations allow access to the frequency of a node or arc: this might be useful for avoiding repetitive regions during assembly, or for reconstructing only those [26], for assembling metagenomes with non-uniform sequencing depths [29], or for inferring transcripts with different expression levels [42].

2. Preliminaries

2.1. Strings

Let $\Sigma=[1..\sigma]$ be an integer alphabet, let $\#=0$ be a separator not in $\Sigma$ , and let $T=[1..\sigma]^{n-1}$ be a string. We denote by $\overline{W}$ the reverse of a string $W$ , i.e. string $W$ written from right to left, and we call $W$ a k-mer iff $|W|=k$ . We denote by $f_{T}(W)$ the number of (possibly overlapping) occurrences of a string $W$ in the circular version of $T$ . A repeat $W$ is a string that satisfies $f_{T}(W)>1$ . We denote by $\Sigma^{\ell}_{T}(W)$ the set of left-extensions of $W$ , i.e. the set of characters $\{a\in[0..\sigma]:f_{T}(aW)>0\}$ . Symmetrically, we denote by $\Sigma^{r}_{T}(W)$ the set of right-extensions of $W$ , i.e. the set of characters $\{b\in[0..\sigma]:f_{T}(Wb)>0\}$ . A repeat $W$ is right-maximal (respectively, left-maximal) iff $|\Sigma^{r}_{T}(W)|>1$ (respectively, iff $|\Sigma^{\ell}_{T}(W)|>1$ ). It is well-known that $T$ can have at most $n-1$ right-maximal substrings and at most $n-1$ left-maximal substrings. A maximal repeat of $T$ (called balanced substring in [50]) is a repeat that is both left- and right-maximal.

The unidirectional de Bruijn graph of order $k$ of $T$ is a directed graph $(V,E)$ whose node set $V$ is in one-to-one correspondence with the set of distinct $k$ -mers that occur in $T$ ; there is an arc $(v,w)\in E$ for every distinct $(k+1)$ -mer $W$ such that both $W[1..k]$ and $W[2..k+1]$ occur in $T$ , and such arc is labelled with character $W[k+1]$ . In some formulations, $E$ contains just those arcs that correspond to $(k+1)$ -mers that occur in $T$ : in this case, a $k$ -mer is right-maximal (respectively, left-maximal) in $T$ iff its corresponding node in $V$ has at at least two outgoing (respectively, incoming) arcs. The bidirectional de Bruijn graph is defined symmetrically.

We denote by $\mathsf{ST}_{T}$ the suffix tree of $T\#$ , and by $\overline{\mathsf{ST}}_{T}$ the suffix tree of $\overline{T}\#$ . We assume the reader to be already familiar with the basics of suffix trees, including suffix links, which we do not further describe here. We denote by $\ell(v)$ the label of a node $v$ of a suffix tree, and we say that $v$ is the locus of all substrings $W[1..k]$ of $T$ where $|\ell(u)|<k\leq|\ell(v)|$ , $u$ is the parent of $v$ , and $W=\ell(v)$ . It is well-known that a substring $W$ of $T$ is right-maximal (respectively, left-maximal) iff $W=\ell(v)$ for some internal node $v$ of $\mathsf{ST}_{T}$ (respectively, for some internal node $v$ of $\overline{\mathsf{ST}}_{T}$ ). Suffix links and internal nodes of $\mathsf{ST}_{T}$ form a tree, called the suffix-link tree of $T$ and denoted by $\mathsf{SLT}_{T}$ , and inverting the direction of all suffix links yields the so-called explicit Weiner links. Given an internal node $v$ and a character $a\in[0..\sigma]$ , it might happen that string $a\ell(v)$ occurs in $T$ but is not right-maximal, i.e. it is not the label of any internal node of $\mathsf{ST}_{T}$ : all such left extensions of internal nodes that end in the middle of an edge are called implicit Weiner links. An internal node $v$ of $\mathsf{ST}_{T}$ can have more than one outgoing Weiner link, and all such Weiner links have distinct labels: in this case, $\ell(v)$ is a maximal repeat, as well as the label of a node in $\overline{\mathsf{ST}}_{T}$ . Maximal repeats and implicit Weiner links are related by the following simple property, which was already hinted at in [2]:

Property 1.

Let $v$ be an internal node of $\mathsf{ST}_{T}$ . If there is an implicit Weiner link from $v$ , then $\ell(v)$ is a maximal repeat of $T$ .

It is known that the number of suffix links (or, equivalently, of explicit Weiner links) is upper-bounded by $2n-2$ , and that the number of implicit Weiner links can be upper-bounded by $2n-2$ as well. We call $\mathsf{SLT}^{*}_{T}$ a version of $\mathsf{SLT}_{T}$ augmented with implicit Weiner links and with nodes corresponding to their destinations. We say that a maximal repeat $W$ of $T$ is rightmost if no string $WV$ with $V\in[0..\sigma]^{+}$ is left-maximal in $T$ . Symmetrically, we say that a maximal repeat $W$ of $T$ is leftmost if no string $VW$ with $V\in[0..\sigma]^{+}$ is right-maximal in $T$ . Since left-maximality is closed under prefix operation, it is easy to see that the maximal repeats of $T$ are all and only the nodes of $\mathsf{ST}_{T}$ that lie on paths that start from the root and that end at nodes labelled by rightmost maximal repeats. We call this the maximal repeat subgraph of $\mathsf{ST}_{T}$ (Figure 2b). Clearly the maximal repeats of $T$ coincide with the branching nodes of $\overline{\mathsf{SLT}}^{*}_{T}$ (Figure 2a), and the rightmost maximal repeats of $T$ coincide with the leaves of $\overline{\mathsf{SLT}}_{T}$ . Thus, it is easy to see that $\overline{\mathsf{SLT}}_{T}$ (a trie) is a subdivision of the maximal repeat subgraph of $\mathsf{ST}_{T}$ (a compact trie), and that the nodes in the unary paths of $\overline{\mathsf{SLT}}_{T}$ are in one-to-one correspondence with the internal nodes of $\overline{\mathsf{ST}}_{T}$ that are not maximal repeats (see Figures 2a and 2b for an example, and see Section 2.1 in [7] for an extended explanation). The following property is thus immediate (and symmetrical notions hold for $\overline{\mathsf{ST}}_{T}$ , $\mathsf{SLT}^{*}_{T}$ , and leftmost maximal repeats):

Property 2.

Let $v$ be an internal node of $\mathsf{ST}_{T}$ . The locus $w$ of $\overline{\ell(v)}$ in $\overline{\mathsf{ST}}_{T}$ is such that $\ell(w)$ is the reverse of a maximal repeat of $T$ .

The compact directed acyclic word graph of a string $T$ (denoted by $\mathsf{CDAWG}_{T}$ in what follows) is the minimal compact automaton that recognizes the suffixes of $T$ [16, 20]. We denote by $\overline{\mathsf{CDAWG}}_{T}$ the CDAWG of the reverse of $T$ , by $e_{T}$ the number of arcs in $\mathsf{CDAWG}_{T}$ , and by $\overline{e}_{T}$ the number of arcs in $\overline{\mathsf{CDAWG}}_{T}$ . The CDAWG of $T$ can be seen as the minimization of $\mathsf{ST}_{T}$ , in which all leaves are merged to the same node (the sink, that represents $T$ itself), and in which all nodes except the sink are in one-to-one correspondence with the maximal repeats of $T$ [44]. Every arc of $\mathsf{CDAWG}_{T}$ is labeled by a substring of $T$ , and the out-neighbors $w_{1},\dots,w_{k}$ of every node $v$ of $\mathsf{CDAWG}_{T}$ are sorted according to the lexicographic order of the distinct labels of arcs $(v,w_{1}),\dots,(v,w_{k})$ . Since there is a bijection between the nodes of $\mathsf{CDAWG}_{T}$ and the maximal repeats of $T$ , the node $v^{\prime}$ of $\mathsf{CDAWG}_{T}$ with $\ell(v^{\prime})=W$ is the equivalence class of the nodes $\{v_{1},\dots,v_{k}\}$ of $\mathsf{ST}_{T}$ such that $\ell(v_{i})=W[i..|W|]$ for all $i\in[1..k]$ , and such that $v_{k},v_{k-1},\dots,v_{1}$ is a maximal unary path of explicit Weiner links. The subtrees of $\mathsf{ST}_{T}$ rooted at all such nodes are isomorphic. It follows that a right-maximal string can be identified by the maximal repeat $W$ it belongs to, and by the length of the corresponding suffix of $W$ (see [8] for an extended explanation).

We assume the reader to be familiar with the Burrows-Wheeler transform of $T$ , which we denote by $\mathsf{BWT}_{T}$ (we use $\overline{\mathsf{BWT}}_{T}$ to denote the BWT of the reverse of $T$ ) and we don’t further describe here. We say that $\mathsf{BWT}_{T}[i..j]$ is a run iff: (1) $\mathsf{BWT}_{T}[k]=c\in[0..\sigma]$ for all $k\in[i..j]$ ; (2) every substring $\mathsf{BWT}_{T}[i^{\prime}..j^{\prime}]$ such that $i^{\prime}\leq i$ , $j^{\prime}\geq j$ , and $[i^{\prime}..j^{\prime}]\neq[i..j]$ , contains at least two distinct characters. We denote by $\mathcal{R}_{T}$ the set of all triplets $(c,i,j)$ such that $\mathsf{BWT}_{T}[i..j]$ is a run of character $c$ , and we use $\overline{\mathcal{R}}_{T}$ to denote the set of runs of $\overline{\mathsf{BWT}}_{T}$ . It is known that $|\mathcal{R}_{T}|$ is at most equal to the number of arcs in $\mathsf{CDAWG}_{T}$ [10].

Given a second string $S\in[1..\sigma]^{+}$ , the matching statistics array $\mathsf{MS}_{S,T}$ of $S$ with respect to $T$ is an array of length $|S|$ such that $\mathsf{MS}_{S,T}[i]$ is the largest $j$ such that $S[i..i+j-1]$ occurs in $T$ .

In the rest of the paper we drop subscripts whenever they are clear from the context.

2.2. String indexes

A bidirectional index is a data structure that, given a constant-space descriptor $\mathtt{id}(W)$ of a substring $W$ of $T$ , supports the following operations: $\mathtt{extendRight}(\mathtt{id}(W),a)=\mathtt{id}(Wa)$ if $f(Wa)>0$ , or an error otherwise; $\mathtt{enumerateRight}(\mathtt{id}(W))=\{\mathtt{id}(Wa):a\in\Sigma,f(Wa)>0\}$ ; $\mathtt{isRightMaximal}(\mathtt{id}(W))=\mathtt{true}$ iff $|\mathtt{enumerateRight}(\mathtt{id}(W))|>1$ . Operations $\mathtt{extendLeft}$ , $\mathtt{enumerateLeft}$ and $\mathtt{isLeftMaximal}$ are defined symmetrically. We consider bidirectional indexes based on the BWT: specifically, we denote with $\mathbb{I}(W,T)$ the function that maps a substring $W$ of $T$ to the interval of $W$ in $\mathsf{BWT}$ , i.e. to the interval of all suffixes of $T\#$ that start with $W$ , and we use $\mathtt{id}(W)=(\mathbb{I}(W,T),\mathbb{I}(\overline{W},\overline{T}),|W|)$ as a constant-space descriptor of $W$ . A number of bidirectional BWT indexes have been described in the literature; in this paper we are just interested in the data structure from [11], which supports all operations in linear time in the size of their output, takes $O(|T|\log{\sigma})$ bits of space, and can be built in randomized $O(|T|)$ time and $O(|T|\log{\sigma})$ bits of working space.

Given a string $T\in[1..\sigma]^{n-1}\#$ , we call run-length encoded BWT ( $\mathsf{RLBWT}_{T}$ ) any representation of $\mathsf{BWT}_{T}$ that takes $O(|\mathcal{R}_{T}|)$ words of space and supports the well-known rank and select operations (see e.g. [31, 32, 48]). It is easy to implement a version of $\mathsf{RLBWT}_{T}$ that supports rank and select in $O(\log{\log{n}})$ time [10]. In this paper we use the representation of the suffix tree based on the CDAWG described in [8], which takes just $O(e+\overline{e})$ words of space by augmenting $\mathsf{CDAWG}$ and $\overline{\mathsf{CDAWG}}$ with the RLBWT of $T$ and $\overline{T}$ . Such a data structure describes a node $v$ of $\mathsf{ST}$ as a tuple $\mathtt{id}(v)=(v^{\prime},|\ell(v)|,i,j)$ , where $v^{\prime}$ is the node in $\mathsf{CDAWG}$ that corresponds to the equivalence class of $v$ , and $[i..j]$ is the interval of $\ell(v)$ in $\mathsf{BWT}$ . For every node $v$ of $\mathsf{CDAWG}$ , the index stores, among other things: $|\ell(v)|$ in a variable $v\mathtt{.length}$ ; the number $v\mathtt{.size}$ of right-maximal strings that belong to its equivalence class; and the interval $[v\mathtt{.first}..v\mathtt{.last}]$ of $\ell(v)$ in $\mathsf{BWT}_{T}$ . For every arc $\gamma=(v,w)$ of $\mathsf{CDAWG}$ , the index stores the first character of $\ell(\gamma)$ in a variable $\gamma\mathtt{.char}$ , and the number of characters of the right extension implied by $\gamma$ in a variable $\gamma\mathtt{.right}$ . Finally, we add to the CDAWG all arcs $(v,w,c)$ such that $w$ is the equivalence class of the destination of a Weiner link from $v$ labeled by character $c$ in $\mathsf{ST}_{T}$ , as well as the reverse of all explicit Weiner link arcs. See [8] for an extended description of the data structure and of the complexity of its operations. Here we just mention that the index supports operations $\mathtt{stringDepth}(\mathtt{id}(v))$ and $\mathtt{child}(\mathtt{id}(v),c)$ in constant time, and $\mathtt{parent}(\mathtt{id}(v))$ , $\mathtt{suffixLink}(\mathtt{id}(v))$ , $\mathtt{weinerLink}(\mathtt{id}(v),c)$ in $O(\log{\log{|T|}})$ time.

In this paper we need to store the topology of $\overline{\mathsf{SLT}}$ and the topology of $\mathsf{ST}$ efficiently. It is well-known that the topology of an ordered tree of $n$ nodes can be represented using $2n+o(n)$ bits, as a sequence of $2n$ balanced parentheses [39]. Let $\mathtt{id}(v)$ be the rank of a node $v$ in the preorder traversal of the tree. Given the balanced parentheses representation of the tree encoded in $2n+o(n)$ bits, it is also well-known that one can build a data structure that takes $2n+o(n)$ bits, and that supports several common operations in constant time [40, 41, 46], among which: $\mathtt{parent}(\mathtt{id}(v))$ , which returns $\mathtt{id}(u)$ , where $u$ is the parent of $v$ , or an error if $v$ is the root; $\mathtt{lca}(\mathtt{id}(v),\mathtt{id}(w))$ , which returns $\mathtt{id}(u)$ , where $u$ is the lowest common ancestor of nodes $v$ and $w$ ; $\mathtt{leftmostLeaf}(\mathtt{id}(v))$ and $\mathtt{rightmostLeaf}(\mathtt{id}(v))$ , which return one plus the number of leaves that, in the preorder traversal of the tree, are visited before the first (respectively, the last) leaf that belongs to the subtree rooted at $v$ ; $\mathtt{depth}(\mathtt{id}(v))$ , which returns the distance of $v$ from the root. This data structure can be built in $O(n)$ time and in $O(n)$ bits of working space. Moreover, given a node $v$ and a length $d$ , a level-ancestor query asks for the ancestor $u$ of $v$ such that the path from the root to $u$ contains exactly $d$ nodes. The level ancestor data structure described in [14, 15] takes $O(n)$ words of space and answers queries in constant time. Assuming that some nodes of the tree are marked, a lowest marked ancestor data structure allows one to move in constant time from any node, to its lowest ancestor that is marked [33].

We use the tree data structures described above to store the topology of $\mathsf{ST}$ and of $\overline{\mathsf{SLT}}$ . Moreover, we mark in two bitvectors the nodes of $\overline{\mathsf{SLT}}$ and of $\mathsf{ST}$ that are maximal repeats (in preorder), and we index such bitvectors to support constant-time rank and select queries. Since $\overline{\mathsf{SLT}}$ is a subdivision of the subgraph of $\mathsf{ST}$ induced by maximal repeats, the $i$ -th one in the two bitvectors correspond to the same maximal repeat. Thus, if node $v$ is a maximal repeat, and if we know its preorder position in $\mathsf{ST}$ , we can compute the length of $\ell(v)$ by moving to the corresponding node $v^{\prime}$ in $\overline{\mathsf{SLT}}$ and by computing the depth of $v^{\prime}$ in the topology of $\overline{\mathsf{SLT}}$ (see [7] for an extended explanation).

The rest of the paper focuses on representations of variable-order, bidirectional de Bruijn graphs that support the following primitives (for brevity we list here just operations in one direction). Let $k$ be the current order of the de Bruijn graph. Operation $v=\mathtt{node}(W)$ , called membership, returns the identifier of the node associated with $k$ -mer $W$ , or an error if $W$ does not occur in $T$ . Operation $C=\mathtt{arcLabels}(v)$ returns the set of characters $C$ that label all arcs from node $v$ in the right direction, and operation $\mathtt{degree}(v)$ returns the number of such arcs. Query $e=\mathtt{arc}(v,c)$ returns the identifier of the arc that corresponds to string $\ell(v)\cdot c$ , if any, where $v$ is a node in the current de Bruijn graph, $\ell(v)$ is the $k$ -mer that corresponds to node $v$ , and $c$ is a character; it returns an error if no such arc exists. Operation $w=\mathtt{followArc}(v,c)$ is similar, but returns the identifier of the node $w$ reached by the arc, if any. Queries $\mathtt{freq}(v)$ and $\mathtt{freq}(e)$ return the number of occurrences of the $k$ -mer associated with node $v$ and of the $(k+1)$ -mer associated with arc $e$ (the number of occurrences of an arc might be zero). Representations that support such queries are called frequency-aware or weighted (see e.g. [42]). Operation $v^{\prime}=\mathtt{increaseK}(v,c)$ for $c\in[0..\sigma]$ returns the node $v^{\prime}$ associated with string $\ell(v)\cdot c$ in the de Bruijn graph of order $k+1$ , if any, or an error otherwise. Operation $v^{\prime}=\mathtt{decreaseK}(v)$ returns the node $v^{\prime}$ associated with the prefix of length $k-1$ of $\ell(v)$ in the de Bruijn graph of order $k-1$ .

In addition to increasing and decreasing the order by one unit, some variable-order representations allow the user to specify the desired amount of change [13, 17]. In the rest of the paper we argue that it is more natural to change the order based on the frequency or on the extensions of $k$ -mers, as proposed in [22]. Specifically, given a node $v$ of the current de Bruijn graph, let $\ell(v)\cdot W$ , $W\in\Sigma^{*}$ , be the longest string with the same frequency as $\ell(v)$ in $T$ . Operation $(v^{\prime},k^{\prime})=\mathtt{increaseK}(v)$ returns the node $v^{\prime}$ associated with $\ell(v)\cdot W$ in the de Bruijn graph of order $k+|W|$ , and sets $k^{\prime}$ to the new order $k+|W|$ . Given a node $v$ of the current de Bruijn graph, let $W$ be the longest prefix of $\ell(v)$ that has a different frequency from $\ell(v)$ in $T$ . Operation $(v^{\prime},k^{\prime})=\mathtt{decreaseK}(v)$ returns the node $v^{\prime}$ associated with $W$ in the de Bruijn graph of order $|W|$ , and sets $k^{\prime}$ to $|W|$ . Alternatively, one might want $W$ to be the longest prefix of $\ell(v)$ such that the left-extensions of $W$ are a superset of the left-extensions of $\ell(v)$ . A de Bruijn graph that supports such operations without returning the value of the new order is called hidden-order [22].

2.3. String indexes

A bidirectional index is a data structure that, given a constant-space descriptor $\mathtt{id}(W)$ of a substring $W$ of $T$ , supports the following operations: $\mathtt{extendRight}(\mathtt{id}(W),a)=\mathtt{id}(Wa)$ if $f(Wa)>0$ , or error otherwise; $\mathtt{enumerateRight}(\mathtt{id}(W))=\{\mathtt{id}(Wa):a\in\Sigma,f(Wa)>0\}$ ; $\mathtt{isRightMaximal}(\mathtt{id}(W))=\mathtt{true}$ iff $|\mathtt{enumerateRight}(\mathtt{id}(W))|>1$ . Operations $\mathtt{extendLeft}$ , $\mathtt{enumerateLeft}$ and $\mathtt{isLeftMaximal}$ are defined symmetrically. Here we consider bidirectional indexes based on the BWT: specifically, we denote with $\mathbb{I}(W,T)$ the function that maps a substring $W$ of $T$ to the interval of $W$ in $\mathsf{BWT}$ , i.e. to the interval of all suffixes of $T\#$ that start with $W$ , and we use $\mathtt{id}(W)=(\mathbb{I}(W,T),\mathbb{I}(\overline{W},\overline{T}),|W|)$ as a constant-space descriptor of a substring $W$ . A number of bidirectional BWT indexes have been described in the literature: here we are interested just in the data structure described in [11], which supports all operations in linear time in the size of their output, takes $O(|T|\log{\sigma})$ bits of space, and can be built in randomized $O(|T|)$ time and $O(|T|\log{\sigma})$ bits of working space. See [11] for more details.

Given a string $T\in[1..\sigma]^{n-1}\#$ , we call run-length encoded BWT ( $\mathsf{RLBWT}_{T}$ ) any representation of $\mathsf{BWT}_{T}$ that takes $O(|\mathcal{R}_{T}|)$ words of space, and that supports the well known rank and select operations: see for example [31, 32, 48]. It is easy to implement a version of $\mathsf{RLBWT}_{T}$ that supports rank in $O(\log{\log{n}})$ time and select in $O(\log{\log{n}})$ time [10]. In this paper we use the representation of $\mathsf{ST}$ based on $\mathsf{CDAWG}$ described in [8], which takes just $O(e+\overline{e})$ words of space by augmenting $\mathsf{CDAWG}$ and $\overline{\mathsf{CDAWG}}$ with the RLBWT of $T$ and of $\overline{T}$ . Such data structure represents a node $v$ of $\mathsf{ST}$ as a tuple $\mathtt{id}(v)=(v^{\prime},|\ell(v)|,i,j)$ , where $v^{\prime}$ is the node in $\mathsf{CDAWG}$ that corresponds to the equivalence class of $v$ , and $[i..j]$ is the interval of $\ell(v)$ in $\mathsf{BWT}$ . For every node $v$ of $\mathsf{CDAWG}$ , the index stores, among other things: $|\ell(v)|$ in a variable $v\mathtt{.length}$ ; the number $v\mathtt{.size}$ of right-maximal strings that belong to its equivalence class; and the interval $[v\mathtt{.first}..v\mathtt{.last}]$ of $\ell(v)$ in $\mathsf{BWT}_{T}$ . For every arc $\gamma=(v,w)$ of $\mathsf{CDAWG}$ , the index stores the first character of $\ell(\gamma)$ in a variable $\gamma\mathtt{.char}$ , and the number of characters of the right extension implied by $\gamma$ in a variable $\gamma\mathtt{.right}$ . Finally, we add to the CDAWG all arcs $(v,w,c)$ such that $w$ is the equivalence class of the destination of a Weiner link from $v$ labeled by character $c$ in $\mathsf{ST}_{T}$ , and the reverse of all explicit Weiner link arcs. See [8] for a full description of the data structure and of the complexity of its operations. Here we just mention that the index supports operations $\mathtt{stringDepth}(\mathtt{id}(v))$ and $\mathtt{child}(\mathtt{id}(v),c)$ in constant time, and $\mathtt{parent}(\mathtt{id}(v))$ , $\mathtt{suffixLink}(\mathtt{id}(v))$ , $\mathtt{weinerLink}(\mathtt{id}(v))$ in $O(\log{\log{|T|}})$ time. It also allows reading the character at position $i$ of $T$ in $O(\log{|T|})$ time.

Finally, in this paper we need to store the topology of $\overline{\mathsf{SLT}}$ and the topology of $\mathsf{ST}$ efficiently. It is well known that the topology of an ordered tree of $n$ nodes can be represented using $2n+o(n)$ bits, as a sequence of $2n$ balanced parentheses built by opening a parenthesis, by recurring on every child of the current node in order, and by closing a parenthesis [39]. Let $\mathtt{id}(v)$ be the rank of a node $v$ in the preorder traversal of the tree. Given the balanced parentheses representation of the tree encoded in $2n+o(n)$ bits, it is also well known that one can build a data structure that takes $2n+o(n)$ bits, and that supports several common operations in constant time [40, 46, 41], among which: $\mathtt{parent}(\mathtt{id}(v))$ , which returns $\mathtt{id}(u)$ , where $u$ is the parent of $v$ , or an error if $v$ is the root; $\mathtt{lca}(\mathtt{id}(v),\mathtt{id}(w))$ , which returns $\mathtt{id}(u)$ , where $u$ is the lowest common ancestor of nodes $v$ and $w$ ; $\mathtt{leftmostLeaf}(\mathtt{id}(v))$ and $\mathtt{rightmostLeaf}(\mathtt{id}(v))$ , which return one plus the number of leaves that, in the preorder traversal of the tree, are visited before the first (respectively, the last) leaf that belongs to the subtree rooted at $v$ ; $\mathtt{selectLeaf}(i)$ , which returns $\mathtt{id}(v)$ , where $v$ is the $i$ -th leaf in preorder; $\mathtt{depth}(\mathtt{id}(v))$ , which returns the distance of $v$ from the root. This data structure can be built in $O(n)$ time and in $O(n)$ bits of working space. Moreover, given a node $v$ and a length $d$ , a level-ancestor query asks for the ancestor $u$ of $v$ such that the path from the root to $u$ contains exactly $d$ nodes. The level ancestor data structure described in [14, 15] takes $O(n)$ words of space and it answers queries in constant time. Assuming that some nodes of the tree are marked, a lowest marked ancestor data structure [33] allows one to move in constant time from any node, to its lowest ancestor that is marked.

We use the tree data structures described above to store the topology of $\mathsf{ST}$ and of $\overline{\mathsf{SLT}}$ . Moreover, we mark in a bitvector the nodes of $\overline{\mathsf{SLT}}$ and of $\mathsf{ST}$ that are maximal repeats (in preorder), and we index such bitvectors to support constant-time rank and select queries. Since $\overline{\mathsf{SLT}}$ is a subdivision of the subgraph of $\mathsf{ST}$ induced by maximal repeats, the $i$ -th one in the two bitvectors correspond to the same maximal repeat. Thus, if node $v$ is a maximal repeat and if we know its position in preorder in $\mathsf{ST}$ , it is easy to see that we can compute the length of $\ell(v)$ by going to the node $v^{\prime}$ in $\overline{\mathsf{SLT}}$ and by computing the depth of $v^{\prime}$ in the topology of $\overline{\mathsf{SLT}}$ : see [7] for a more thorough explanation.

3. Contracting in constant time

As mentioned, existing bidirectional BWT indexes support left-contraction just from right-maximal substrings (and symmetrically, they support right-contraction just from left-maximal substrings). Specifically, if the substring $aW$ is right-maximal and labels a node $v$ of $\mathsf{ST}$ , then $\mathbb{I}(W,T)$ is the interval of node $\mathtt{suffixLink}(v)$ in $\mathsf{ST}$ , and since we are removing one character from the right of $\overline{aW}$ , the locus of $\overline{W}$ in $\overline{\mathsf{ST}}$ is either the same as the locus $w$ of $\overline{aW}$ , or it is $\mathtt{parent}(w)$ , whichever has the same frequency as $\mathbb{I}(W,T)$ [11, 38].

To support left-contraction from a substring that is not right-maximal, it is enough to have access to the topology of $\overline{\mathsf{SLT}}$ :

Theorem 1.

Let $T$ be a string on alphabet $\Sigma$ . There is a data structure that supports operations $\mathtt{extendRight}$ , $\mathtt{extendLeft}$ , $\mathtt{contractRight}$ and $\mathtt{contractLeft}$ in constant time and in $O(n\log\sigma)$ bits of space. Such a data structure can be built in randomized $O(n)$ time and $O(n\log\sigma)$ bits of working space.

Proof.

We use the data structures described in [11], augmented with the topology of $\mathsf{SLT}$ and with a bitvector to commute between the topology of $\mathsf{ST}$ and the topology of $\mathsf{SLT}$ (see [7] for details on commuting). Such data structures take $O(n\log\sigma)$ bits of space, and they can be built in randomized $O(n)$ time using the algorithms in [4, 12]. They support operations $\mathtt{extendRight}(\mathtt{id}(W),a)=\mathtt{id}(Wa)$ and $\mathtt{extendLeft}(\mathtt{id}(W),a)=\mathtt{id}(aW)$ , where $\mathtt{id}(W)=(\mathbb{I}(W,T),\mathbb{I}(\overline{W},\overline{T}))$ . We additionally assume the knowledge of $|W|$ , i.e. $\mathtt{id}(W)=(\mathbb{I}(W,T),\mathbb{I}(\overline{W},\overline{T}),|W|)$ . We only show how to support $\mathtt{contractLeft}(\mathtt{id}(aW))=\mathtt{id}(W)$ , since supporting $\mathtt{contractRight}(\mathtt{id}(Wa))=\mathtt{id}(W)$ is symmetric. Since [11] already supports $\mathtt{contractLeft}(\mathtt{id}(aW))$ for right-maximal substrings, we assume for now that $aW$ is not right-maximal. Note that we can decide whether $aW$ is right-maximal or not by using $\mathbb{I}(\overline{aW},\overline{T})$ , and, if $W$ is right-maximal, we can just use the contraction algorithm described above. Let $v$ be the locus of $aW$ in $\mathsf{ST}$ : this can be computed from $\mathbb{I}(aW,T)$ using $\mathtt{lca}$ queries on $\mathsf{ST}$ . Since $aW$ is not right maximal, $aW\neq\ell(v)$ and $aW$ ends in the middle of edge $(u,v)$ of $\mathsf{ST}$ . We take in constant time the suffix link $(u,u^{\prime})$ from $u$ and the suffix link $(v,v^{\prime})$ from $v$ , and we decide whether $(u^{\prime},v^{\prime})$ is an edge or a path of $\mathsf{ST}$ by comparing $u^{\prime}$ to $\mathtt{parent}(v^{\prime})$ , which can be computed in constant time. If $(u^{\prime},v^{\prime})$ is an edge of $\mathsf{ST}$ (Figure 2c), then $v^{\prime}$ is the locus of $W$ and we compute $\mathbb{I}(\ell(v^{\prime}),T)$ in constant time. Otherwise (Figure 2d), we compute in constant time $z=\mathtt{parent}(v^{\prime})$ : this node is a maximal repeat by Property 1, since it is an internal node of $\mathsf{ST}$ with an implicit Weiner link whose destination falls inside $(u,v)$ . We use the data structures in Section 2.3 to measure the length of $\ell(z)$ in constant time. If $|W|>|\ell(z)|$ , the locus of $W$ is again $v^{\prime}$ . Otherwise, since $z$ is a maximal repeat, we move in constant time to the node $z^{\prime}$ of $\overline{\mathsf{SLT}}$ that corresponds to $\ell(z)$ , we issue a constant-time level ancestor query from $z^{\prime}$ on $\overline{\mathsf{SLT}}$ with length $|W|$ , and, from the destination $x^{\prime}$ of such a level ancestor query, we move in constant time to the first branching descendant $y^{\prime}$ of $x^{\prime}$ , by using $\mathtt{leftmostLeaf}$ , $\mathtt{rightmostLeaf}$ , and $\mathtt{lca}$ queries on $\overline{\mathsf{SLT}}$ . Finally, we move in constant time to the node $y$ of $\mathsf{ST}$ that corresponds to $y^{\prime}$ , and we compute $\mathbb{I}(\ell(y),T)$ in constant time. We compute $\mathbb{I}(\overline{W},\overline{\mathsf{ST}})$ as described at the beginning of Section 3. ∎

Note that the algorithm in Theorem 1 works even when $aW$ is right-maximal; moreover, if the information on whether $aW$ is right maximal or not is given in input, the algorithm can decide whether $W$ is right maximal or not. In a practical implementation, once we have taken the suffix link $(v,v^{\prime})$ from $v$ , we could check whether $v^{\prime}$ is a maximal repeat, and in the positive case we could immediately commute to $\overline{\mathsf{SLT}}$ and issue level ancestor queries. If $v^{\prime}$ is not a maximal repeat, we could move in constant time to the lowest ancestor $v^{\prime\prime}$ of $v^{\prime}$ that is a maximal repeat, using a lowest marked ancestor data structure on $\mathsf{ST}$ , we could measure $|\ell(v^{\prime\prime})|$ , and if $|\ell(v^{\prime\prime})|\geq|W|$ , we could again issue level ancestor queries in $\overline{\mathsf{SLT}}$ (otherwise, the locus of $W$ is again $v^{\prime}$ ).

A bidirectional index on $T$ that supports extension and contraction in constant time, can be used to implement in linear time several applications that slide a window $S[i..j]$ of fixed length over a query string $S$ , and that compute the frequency of every $S[i..j]$ in $T$ , without the size of the window being known during construction444If the size $k$ of the window is fixed and known during construction, most such applications do not need the contract operation, and can be made to work using just one BWT and a bitvector of length $|T|$ that marks the boundaries of $k$ -mer intervals in the BWT.. For example, measuring the frequency of windows of fixed length for read correction [43], computing the inner product between the $k$ -mer composition vectors of $S$ and $T$ (a step in $k$ -mer kernels), estimating the probability of $S$ according to a fixed-order Markov model trained on $T$ , and checking whether $S$ is a path in the de Bruijn graph of $T$ . Our index enables also applications in which the sliding window needs to be extended or contracted during the scan, like variable-order and interpolated Markov models (see [21] for an overview). A fully-functional bidirectional index is not needed for computing the matching statistics array between $S$ and $T$ , in linear time and in $O(|T|\log\sigma)$ bits of space, since one can use the algorithms in [5] on top of the data structures in [4]. However, achieving such bounds with our bidirectional index becomes trivial.

In practical applications of matching statistics, one typically needs to maintain the intervals in both $\mathsf{BWT}$ and $\overline{\mathsf{BWT}}$ just after every successful right extension, and, when the current match $S[i..j]$ cannot be extended with $S[j+1]$ in $T$ any longer, one might need both BWT intervals just for the proper suffixes $S[k..j]$ such that $\Sigma^{r}_{T}(S[i..j])\subset\Sigma^{r}_{T}(S[k..j])$ , i.e. just for the suffixes of $S[i..j]$ from which a right-extension with $S[j+1]$ is attempted again. Every such suffix is a maximal repeat ancestor of $\overline{S[i..j]}$ in $\overline{\mathsf{ST}}$ [9], thus, once we reach the locus of such a suffix in $\overline{\mathsf{ST}}$ with $\mathtt{parent}$ operations, we can compute its interval in $\overline{\mathsf{BWT}}$ , we can measure its string length $p$ , and we can compute its interval in $\mathsf{BWT}$ by issuing $\mathsf{MS}[i]-p$ contract operations from the locus of $S[i..j]$ in $\mathsf{ST}$ , but without updating the interval in $\overline{\mathsf{BWT}}$ after each contraction. Even more aggressively, we can just issue $\mathsf{MS}[i]-p$ suffix links from the locus of $S[i..j]$ in $\mathsf{ST}$ . Note that such a locus might correspond to the right-maximal string $S[i..j]\cdot V$ for some nonempty $V$ , thus taking $\mathsf{MS}[i]-p$ suffix links might lead to a node of $\mathsf{ST}$ that corresponds to the right-maximal string $S[k..j]\cdot V$ : thus, we need to move in constant time from such a node, to its lowest ancestor in $\mathsf{ST}$ that is a maximal repeat; from there, we can then issue a level ancestor query with value $p$ . Such a lazy synchronization might be faster than issuing $\mathsf{MS}[i]-p$ full contract operations in practice.

Our index can be seen as a representation of a de Bruijn graph that supports bidirectional navigation, that allows access to the frequency of every $k$ -mer and $(k+1)$ -mer, and that has no upper bound on the order: we call infinite-order such a de Bruijn graph. Note that, for a given order $k$ , we can support both the variant in which arcs must occur in $T$ (calling $\mathtt{extendRight}$ and then $\mathtt{contractLeft}$ to implement $\mathtt{arc}$ and $\mathtt{followArc}$ ), and the variant in which arcs do not have to occur in $T$ (calling $\mathtt{contractLeft}$ and then $\mathtt{extendRight}$ ). Membership queries reduce to backward searches, and we can move from a higher to a lower order using the same algorithm as in matching statistics. Indeed, one typically wants to switch to a suffix of the current $k$ -mer whenever there is only one arc in the graph of the current order, and this arc is labelled with the terminator character [22]; or, more generally, whenever one needs to increase the number of outgoing arcs from the current $k$ -mer (for example because the existing ones have already been explored [37]), or to increase the frequency of the current right-maximal $k$ -mer. In all such cases, one wants to switch to the largest order with the desired property, and the corresponding suffix is always a maximal repeat (for example, the longest suffix, of the current right-maximal $k$ -mer, that has strictly greater frequency, is a maximal repeat). Symmetrically, when increasing the order, one may want to switch e.g. from the current $k$ -mer $W$ that is left-maximal but not right-maximal, to the maximal repeat $WV$ with shortest $V$ . Clearly $\mathbb{I}(WV,T)=\mathbb{I}(W,T)$ , we know $|V|$ since we can access $|WV|$ , and we can compute $\mathbb{I}(\overline{WV},\overline{T})$ by taking $|V|$ Weiner links from $\mathbb{I}(\overline{W},\overline{T})$ . All such Weiner links are explicit, and in practice we can just update the first position of the interval at every step.

In the next section, we describe a representation of an infinite-order de Bruijn graph in which the time to decrease or increase the order does not depend on the difference between the source and the destination order.

4. Implementing de Bruijn graphs with CDAWGs

An affix link $\mathbb{A}(w)$ is a map from a node $w$ of $\mathsf{ST}$ , to the locus of $\overline{\ell(w)}$ in $\overline{\mathsf{ST}}$ (we use $\overline{\mathbb{A}}(w)$ to denote the symmetrical map from a node $w$ of $\overline{\mathsf{ST}}$ , to the locus of $\overline{\ell(w)}$ in $\mathsf{ST}$ ) [49, 50]. We use $\mathbb{A}(W)$ as a shorthand for $\mathbb{A}(w)$ where $w$ is the locus of $W$ . In asynchronous bidirectional indexes, affix links are used to switch direction when the user desires [50]. In this section we are more interested in their ability to extend a non-maximal repeat in a bidirectional index: for example, if $W$ is right-maximal but not left-maximal, and if it has loci $(v,w)$ in $\mathsf{ST}$ and $\overline{\mathsf{ST}}$ , respectively, then its shortest left-maximal extension $VW$ with $|V|\geq 0$ , i.e. the shortest maximal repeat that contains $W$ as a (not necessarily proper) suffix, has loci $(\overline{\mathbb{A}}(w),w)$ ; and if $W$ is neither left- nor right-maximal, then the shortest maximal repeat $UWV$ with the same frequency as $W$ has loci $(\overline{\mathbb{A}}(\mathbb{A}(v)),\mathbb{A}(v))=(\overline{\mathbb{A}}(w),\mathbb{A}(\overline{\mathbb{A}}(w)))$ [50]. Thus, in what follows we ignore affix links from leaves.

Rather than storing $\mathbb{A}(w)$ for every internal node $w$ of $\mathsf{ST}$ , it has been proposed to sample $\mathbb{A}(w)$ every $p$ suffix links [18]: indeed, $\mathbb{A}(w)$ is either $v=\mathbb{A}(\mathtt{suffixLink}(w))$ , if $|\ell(v)|\geq|\ell(w)|$ , or it is the child of $v$ obtained by following the first character of $\ell(w)$ [50]. This allows one to compute $\mathbb{A}(w)$ in $O(p)$ time, paying $O((|T|/p)\log{n})$ bits of space. We briefly observe that, compared to existing sampling schemes for bidirectional indexes, we can further reduce space to $O((|T|/p)\log{m})$ bits, where $m$ is the number of maximal repeats of $T$ , since, by Property 2, $\mathbb{A}(v)$ is a maximal repeat of $T$ for every internal node $v$ of $\mathsf{ST}_{T}$ . In practice following Weiner links is faster than following suffix links: thus, one could sample the value of $\mathbb{A}(w)$ for every maximal repeat, and then sample every $p$ characters inside an edge of $\overline{\mathsf{ST}}$ that connects two maximal repeats, i.e. every $p$ explicit Weiner links. If $\mathbb{A}(w)$ is not sampled, then $\ell(w)$ is not left-maximal, so we take the only possible Weiner link from it and we repeat the search from there, returning the value of the first sampled node we find. This sampling scheme takes $O((m+(|T|-m)/p)\log{m})$ bits of space. One could even waive sampling the nodes of $\mathsf{ST}$ that are not maximal repeats, but to retrieve their value one would have to pay a number of Weiner links that is at most equal to the length of the longest edge of $\overline{\mathsf{ST}}$ connecting two maximal repeats. Clearly, sampling just maximal repeats works also for the scheme based on suffix links.

In this section we store $\mathbb{A}(w)$ and $\overline{\mathbb{A}}(w)$ explicitly, but just for maximal repeats, together with $\mathsf{CDAWG}_{T}$ and $\overline{\mathsf{CDAWG}}_{T}$ , to implement an infinite-order de Bruijn graph in which the time to increase or decrease the order does not depend on the difference between the source and the destination order:

Theorem 2.

Given a string $T$ , there are a fully-functional bidirectional index, and an infinite-order representation of the de Bruijn graph of $T$ , that take space proportional to the number of left and right extensions of the maximal repeats of $T$ , and that support all queries in $O(\log{\log{|T|}})$ time.

Proof.

We represent $\mathsf{ST}$ and $\overline{\mathsf{ST}}$ using CDAWGs, as described in [8] and summarized in Section 2.3 of this paper. In addition to $\mathsf{RLBWT}$ , $\overline{\mathsf{RLBWT}}$ , $\mathsf{CDAWG}$ and $\overline{\mathsf{CDAWG}}$ , to support Theorem 1 we store also a weighted level ancestor data structure on the maximal repeat subgraph of $\mathsf{ST}$ and $\overline{\mathsf{ST}}$ , which takes $O(m)$ space and answers queries in $O(\log{\log{|T|}})$ time [1, 24], and we store $\mathbb{A}$ and $\overline{\mathbb{A}}$ to support changes in the order of the de Bruijn graph ( $m$ is the number of maximal repeats of $T$ ). We represent an arbitrary substring $W$ of $T$ as a triple $(\mathtt{id}(v),\mathtt{id}(w),|W|)$ , where $v$ is the locus of $W$ in $\mathsf{ST}$ , $w$ is the locus of $\overline{W}$ in $\overline{\mathsf{ST}}$ , and $\mathtt{id}$ is the identifier of a node in the CDAWG-based representation of a suffix tree, i.e. $\mathtt{id}(v)=(v^{\prime},|\ell(v)|,i,j)$ where $v^{\prime}$ is a node of a CDAWG and $[i..j]$ is a BWT interval.

To implement $\mathtt{extendRight}(W,c)$ , where $Wc$ is assumed to occur in $T$ , we first check whether $W$ is right-maximal, by comparing $|W|$ to $|\ell(v)|$ : if $W$ is not right-maximal, then the representation of $Wc$ is $(\mathtt{id}(v),\mathtt{weinerLink}(\mathtt{id}(w),c),|W|+1)$ . Otherwise, the representation is $(\mathtt{child}(\mathtt{id}(v),c),\mathtt{weinerLink}(\mathtt{id}(w),c),|W|+1)$ . If we assume that procedure $\mathtt{extendRight}(W,c)$ can be called with an invalid $c$ , we first have to check whether $Wc$ occurs in $T$ using the interval of $\overline{W}$ in $\overline{\mathsf{BWT}}$ . To implement $\mathtt{contractLeft}(aW)$ , we first check whether $aW$ is right-maximal, by comparing $|aW|$ to $|\ell(v)|$ : if so, the representation of $W$ is $(\mathtt{suffixLink}(\mathtt{id}(v)),\mathtt{id}(w^{\prime}),|W|)$ , where $w^{\prime}$ is either the parent of $w$ or $w$ itself, depending on which one of them has the same frequency as the locus of $W$ in $\mathsf{ST}$ . If $aW$ is not right-maximal, we run the algorithm in Theorem 1 using the $\mathtt{suffixLink}$ and $\mathtt{parent}$ operations provided by the CDAWG-based representation of $\mathsf{ST}$ , and issuing weighted level ancestor queries on the maximal repeat subgraph of $\mathsf{ST}$ rather than level ancestor queries on the topology of $\overline{\mathsf{SLT}}$ .

To implement $\mathtt{decreaseK}$ and $\mathtt{increaseK}$ in the de Bruijn graph, we proceed as follows. If the current $k$ -mer $W$ is right-maximal, the representation of the longest suffix of $W$ that is a maximal repeat is clearly $(\mathtt{id}(z),\mathtt{id}(\mathbb{A}(z)),|\ell(z)|)$ , where $z$ is the maximal repeat reached by taking a suffix link arc from the node of the CDAWG pointed by $\mathtt{id}(v)$ . One could further move to a suitable ancestor of such a maximal repeat, by marking the topology of the maximal repeat subgraph of $\mathsf{ST}$ . If the current $W$ is left-maximal but not right-maximal, the representation of the shortest maximal repeat of the form $WV$ for some nonempty $V$ is $(\mathtt{id}(z),\mathtt{id}(\mathbb{A}(z)),|\ell(z)|)$ , where $z$ is the node of the CDAWG pointed by $\mathtt{id}(v)$ . The same holds if $W$ is neither left- nor right-maximal, and if we want to move to the shortest $k$ -mer that contains $W$ and is both left- and right-maximal. Implementing the other operations of a bidirectional de Bruijn graph is straightforward and is left to the reader. We use data structures from [6] to answer the membership query $\mathtt{node}(W)$ in $O(|W|)$ time. ∎

Our construction based on two CDAWGs is reminiscent of the symmetric compact DAWG described in [16], which was used however just for bidirectional extension. Theorem 2 could be simplified in several ways for a practical implementation. For example, as noted already in [16], since $\mathsf{CDAWG}$ and $\overline{\mathsf{CDAWG}}$ share the same set of nodes, every such node could be stored only once, in which case $\mathbb{A}$ and $\overline{\mathbb{A}}$ would not need to be represented explicitly. If the descriptor of a substring $W$ is $(\mathtt{id}(v),\mathtt{id}(w),|W|)$ with $\mathtt{id}(v)=(v^{\prime},|\ell(v)|,i,j)$ and $\mathtt{id}(w)=(w^{\prime},|\ell(w)|,i^{\prime},j^{\prime})$ , then $v^{\prime}$ and $w^{\prime}$ would become pointers to the same node, $|\ell(w)|$ could be derived from $|\ell(v^{\prime})|-|\ell(v)|+|W|$ , and rather than storing $i,j$ and $i^{\prime},j^{\prime}$ , we could just store $i,i^{\prime},f(W)$ . Our representation collapses to the sink of a CDAWG all $k$ -mers that occur just once in the dataset, which are likely induced by sequencing errors and are thus not useful for most applications: in this case, we don’t even need to store left and right extensions of maximal repeats directed to the sink. If the target application never uses orders smaller than a threshold $\tau$ , we could remove from the index all maximal repeats of length smaller than $\tau$ and prune the top part of the corresponding tree data structures, as described in [22]. We could proceed in a similar way when the user specifies a lower bound on the frequency of $k$ -mers (called solid, see e.g. [29, 37]).

5. Discussion and extensions

Our CDAWG-based representation of the de Bruijn graph might be practical: a full experimental study and a careful implementation of each primitive would be an interesting research direction. Given a node $v$ in the de Bruijn graph, it would also be interesting to know if we can traverse an entire maximal non-branching path, i.e. a path in which no $k$ -mer except for $v$ and the destination has more than one arc to the left and to the right, without taking time proportional to the length of such a path: this would provide a fast implementation of the compacted de Bruijn graph (see e.g. [19, 36] and references therein). It is natural to wonder whether one can support the operations of an infinite-order de Bruijn graph in less space than our indexes. Another open question is whether the CDAWG can be used as a substrate for implementing the string graph as well, and whether we can design a single compact index, as wished by [23], that supports both the primitives of a string graph and of an infinite-order de Bruijn graph efficiently, allowing the user to take advantage of both approaches in genome assembly.

6. Acknowledgements

We thank Martin Bundgaard for motivating the contract operation, Rodrigo Canovas for discussions about bidirectional indexes, Gene Myers for discussions about PacBio CCS reads, and German Tischler for help with $k$ -mer counting.

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Amihood Amir, Gad M Landau, Moshe Lewenstein, and Dina Sokol. Dynamic text and static pattern matching. ACM Transactions on Algorithms (TALG) , 3(2):19, 2007.
2[2] Alberto Apostolico and Gill Bejerano. Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. Journal of Computational Biology , 7(3-4):381–393, 2000.
3[3] Uwe Baier, Timo Beller, and Enno Ohlebusch. Graphical pan-genome analysis with compressed suffix trees and the Burrows–Wheeler transform. Bioinformatics , 32(4):497–504, 2015.
4[4] Djamal Belazzougui. Linear time construction of compressed text indices in compact space. In Proceedings of the forty-sixth Annual ACM Symposium on Theory of Computing , pages 148–193. ACM, 2014.
5[5] Djamal Belazzougui and Fabio Cunial. Indexed matching statistics and shortest unique substrings. In International Symposium on String Processing and Information Retrieval , pages 179–190. Springer, 2014.
6[6] Djamal Belazzougui and Fabio Cunial. Fast label extraction in the CDAWG. In International Symposium on String Processing and Information Retrieval , pages 161–175. Springer, 2017.
7[7] Djamal Belazzougui and Fabio Cunial. A framework for space-efficient string kernels. Algorithmica , 79(3):857–883, 2017.
8[8] Djamal Belazzougui and Fabio Cunial. Representing the suffix tree with the CDAWG. In LIP Ics-Leibniz International Proceedings in Informatics , volume 78. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Fully-functional bidirectional Burrows-Wheeler indexes

Abstract.

1. Introduction

2. Preliminaries

2.1. Strings

Property 1**.**

Property 2**.**

2.2. String indexes

2.3. String indexes

3. Contracting in constant time

Theorem 1**.**

Proof.

4. Implementing de Bruijn graphs with CDAWGs

Theorem 2**.**

Proof.

5. Discussion and extensions

6. Acknowledgements

Property 1.

Property 2.

Theorem 1.

Theorem 2.