Computing the BWT and LCP array of a Set of Strings in External Memory

Paola Bonizzoni; Gianluca Della Vedova; Yuri Pirola; Marco; Previtali; Raffaella Rizzi

arXiv:1705.07756·cs.DS·December 7, 2020

Computing the BWT and LCP array of a Set of Strings in External Memory

Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco, Previtali, Raffaella Rizzi

PDF

1 Repo

TL;DR

This paper introduces an external memory algorithm for efficiently computing the BWT and LCP array of large string collections, crucial for genome assembly, using a novel backward approach that reduces memory and I/O requirements.

Contribution

It presents a new external memory algorithm employing a backward strategy to compute BWT and LCP arrays simultaneously for large string sets, improving efficiency over previous in-memory methods.

Findings

01

Algorithm runs in O(mkl) time and I/O volume for constant-length strings.

02

Uses only O(k + m) main memory, suitable for large datasets.

03

Effective for genome assembly and large-scale string indexing.

Abstract

Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multistring generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes. In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of…

Tables2

Table 1. Table 1: Time required to compute the BWT and the LCP array (in minutes) on the NA24385 and random datasets using a PC with 1GB RAM. The first column indicates the number of sequences in the dataset. Column l 𝑙 l indicates the maximum value of the LCP array on that dataset. ⋆ ⋆ \star means that the tool required more than 1GB of RAM. ⋄ ⋄ \diamond means that the tool crashed because of disk space exhaustion.

No. of strings	Dataset: NA24385
No. of strings	$l$	ble	BEETL2	BEETL	gsa-is	egsa	egap
1M	148	26	12	11	28	60	1
2M	148	53	25	23	$⋆$	239	18
4M	148	105	51	52	$⋆$	693	44
8M	148	213	122	124	$⋆$	2370	184
16M	148	414	241	227	$⋆$	$⋄$	448
32M	148	855	633	$⋆$	$⋆$	$⋄$	915

Table 2. Table 2: Peak RAM usage on the NA24385 and random datasets using a PC with 1GB RAM. The first column indicates the number of sequences in the dataset. Column l 𝑙 l indicates the maximum value of the LCP array on that dataset. ⋆ ⋆ \star means that the tool required more than 1GB of RAM. ⋄ ⋄ \diamond means that the tool crashed because of disk space exhaustion.

No. of strings	Dataset: NA24385
No. of strings	$l$	ble	BEETL2	BEETL	gsa-is	egsa	egap
1M	148	6	35	255	747	764	728
2M	148	10	67	453	$⋆$	770	760
4M	148	18	131	722	$⋆$	775	774
8M	148	34	255	710	$⋆$	773	772
16M	148	65	483	732	$⋆$	$⋄$	768
32M	148	127	781	$⋆$	$⋆$	$⋄$	763

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AlgoLab/bwt-lcp-em
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\publishers

DISCo, Università degli Studi di Milano–Bicocca, Milan, Italy

∗Corresponding author [email protected]

Computing the multi-string BWT and LCP array in external memory

Paola Bonizzoni∗

Gianluca Della Vedova

Yuri Pirola

Marco Previtali

Raffaella Rizzi

Abstract

Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes.

In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of $m$ strings of different lengths. The algorithm over a set of strings having constant length $k$ has $\mathcal{O}(mkl)$ time and I/O volume, using $\mathcal{O}(k+m)$ main memory, where $l$ is the maximum value in the LCP array.

1 Introduction

In this paper we address the problem of constructing, simultaneously and in external memory, the Burrows-Wheeler Transform (BWT) and the Longest Common Prefix (LCP) array for a large collection of strings. The widespread use of Next-Generation Sequencing (NGS) technologies, that are producing everyday several terabytes of data that has to be analyzed, requires efficient strategies to index very large collections of strings. For example, common applications in metagenomics require indexing of collections of strings (reads) that are sampled from several genomes: those strings can easily contain more than $10^{8}$ characters. In fact, to start a catalogue of the human gut microbiome, more than 500GB of data have been used [1].

The Burrows-Wheeler Transform (BWT) [2] is a reversible transformation of a text that was originally designed for text compression; it is used for example in the bzip2 program. The BWT of a text $T$ is a permutation of its symbols and is strictly related to the Suffix Array of $T$ . In fact, the i- ${th}$ symbol of the BWT is the symbol preceding the i- ${th}$ smallest suffix of $T$ according to the lexicographical sorting of the suffixes of $T$ . The BWT has gained importance beyond its initial purpose, and has become the basis for self-indexing structures such as the FM-index [3], which allows to efficiently perform important tasks such as searching a pattern in a text [3, 4, 5]. The generalization of the BWT (and the FM-index) to a collection of strings was introduced in [6, 7].

An entire generation of recent Bioinformatics tools heavily rely on the notion of BWT. Representing the reference genome with its FM-index is the basis of the most widely used aligners, such as Bowtie [8], BWA [9, 10] and SOAP2 [11]. Still, to attack some other fundamental Bioinformatics problems, such as genome assembly, an all-against-all comparison among the input strings is needed, especially to find all prefix-suffix matches (or overlaps) between reads in the context of the Overlap-Layout-Consensus (OLC) approach based on string graph [12]. This fact justifies the search for extremely efficient algorithms to compute the BWT on a collection of strings [13, 14, 15, 16]. For example, SGA (String Graph Assembler) [17] is a de novo genome assembler that builds a string graph from the FM-index of the collection of input reads. In a preliminary version of SGA [18], the authors estimated, for human sequencing data at a 20x coverage, the need of 700Gbytes of RAM in order to build the suffix array, using the construction algorithm in [19], and the FM-index. Another technical device that is used to tackle the genome assembly in the OLC approach is the Longest Common Prefix (LCP) array of a collection of strings, which is instrumental to compute the prefix-suffix matches in the collection.

The construction of the BWT and LCP array of a huge collection of strings is a challenging task. A simple approach is constructing the BWT from the Suffix Array, but it is prohibitive for massive datasets mostly due to the main memory requirements. A first attempt to solve this problem [20] partitions the input collection into batches, computes the BWT for each batch and then merges the results.

The huge amount of available biological data has stimulated the development of efficient external memory algorithms (called, BCR and BCRext) to construct the BWT of a collection of strings [21]. Similarly, a lightweight approach to the construction of the LCP array (called extLCP) was investigated in [22]. With the ultimate goal of obtaining an external memory genome assembler, LSG [23, 24] is based on BCRext and contains an external memory approach to compute the string graph of a set of reads. In that approach, external memory algorithms to compute the BWT and the LCP array [16, 25] are fundamental.

In this context, we are considering a model of computation where memory is split into two parts: a finite random access memory, and an unlimited sequential access disk. Essentially, this model is an extension of the standard RAM model where we also have a sequential access disk.

In this paper we present a new lightweight (external memory) approach to compute the BWT and the LCP array of a collection of strings of different lengths, which is alternative to extLCP [22] and other approaches [26, 27, 28, 29, 30]. The literature is rich of in-memory methods [31, 32], as well as some external-memory algorithm on a single text [31]. From a theoretical point of view, we can transform a set of strings into an instance consisting of a single text by concatenating the input strings after adding a distinct sentinel for each string. Anyway this would increase the alphabet size from $\sigma$ to $m+\sigma$ , and this effect must be taken into account.

The algorithm BCRext is proposed together with BCR and both are designed to work on huge collections of strings (the experimental analysis is on hundreds of millions of 100-long strings). Especially extLCP is lightweight because, on a collection of $m$ strings of length $k$ , it uses only $\mathcal{O}(m+\sigma^{2})$ RAM space and essentially $\mathcal{O}(mk^{2})$ CPU time, with matching I/O volume, under the usual assumption that the word size is sufficiently large to store all addresses.

An important question is how to achieve the optimal $\mathcal{O}(km)$ I/O volume. BCRext [21] incrementally computes the BWT of the collection $S$ via $k+1$ iterations. At each iteration $l$ ( $0\leq l\leq k$ ) the algorithm computes a partial BWT $bwt_{l}(S)$ that is the BWT for the ordered collection of suffixes with length at most $l$ . This approach requires that, at each iteration $l$ , the symbols preceding the suffixes of $S$ with length $l-1$ must be inserted at their correct positions into $bwt_{l-1}(S)$ , that is each $l$ iteration simulates the insertion of the suffixes with length $l$ in the ordered collection of the suffixes with length at most $l-1$ . Updating the partial BWT $bwt_{l}(S)$ in external memory, the process requires a sequential scan of the file containing the information of the partial $bwt_{l-1}(S)$ . Thus the I/O volume at each iteration $l$ is at least $m(l-1)\lg\sigma$ (since there are $m$ suffixes for each length $l$ between $1$ to $l-1$ ). Consequently the total I/O volume for computing $bwt_{k}(S)$ is at least $O(mk^{2})$ . More precisely, the BCRext algorithm in [21] that uses less RAM, requires at each $l$ iteration an additional I/O volume given by $m\lg(km)$ , due to a process of ordering special arrays used to save RAM space. Our algorithm for building the BWT and the LCP, differently from [21], consists of two distinct phases: the first phase that has $O(mk)$ I/O volume and time complexity produces $k+1$ arrays $B_{0},\ldots,B_{k}$ , each array $B_{l}$ lists the symbols preceding the suffixes with length exactly $l$ according to the lexicographical ordering of such suffixes. The second phase computes the interleave of vectors $B_{0},\cdots,B_{k}$ that is equal to the BWT $B$ of $S$ . Indeed, the BWT $B$ is an interleave of the arrays $B_{0},\ldots,B_{k}$ , since the ordering of symbols in $B_{l}$ is preserved in the final BWT $B$ , i.e., $B$ is stable w.r.t. each array $B_{0},\ldots,B_{k}$ . Inspired by [27], we perform this step by a number of $L$ iterations, where $L$ the length of the longest substring that has at least two occurrences in $S$ . Thus the merging operation takes fewer iterations than BCRext (the latter requires $k$ iterations). Observe that at each iteration of the merging procedure of the arrays $B_{0},\ldots B_{k}$ , a partial LCP array is computed to get the final LCP array at the last iteration.

Our algorithm has $\mathcal{O}(mkl\sigma)$ time complexity, uses $\mathcal{O}(mkl\max\{\lg m,\lg l\})$ I/O volume, and $\mathcal{O}(\sigma w+kw+m\lg\sigma+\lg l)$ main memory, where $l$ is the maximum value in the LCP array and $w$ is the space required to store a memory address. Moreover, our approach is entirely based on linear scans (i.e., it does not contain a sorting step) which makes it more amenable to actual disk-based implementations. We point out that $l\leq k$ , therefore our time and I/O complexities are at least as good as those of extLCP [22] when building the data structures for massive sets of short sequences over a constant alphabet and if $\lg m$ and $\lg l$ are smaller than the word size (which is usually the case). The RAM usage of our approach and that of extLCP are not easily comparable, since they are respectively $\mathcal{O}(\sigma w+kw+m\lg\sigma+\lg l)$ and $\mathcal{O}(m+\sigma^{2})$ . If we suppose that we can store a memory address in a memory word, our RAM usage is $\mathcal{O}(\sigma+k+m\lg\sigma+\lg l)$ . This means that, in theory, when building the BWT and the LCP of few large strings extLCP will use less RAM than the method presented in this paper. We point out that our algorithm works also on a set of reads having different lengths, and the following sections describe the algorithm referring to that case.

While writing our paper, two similar approaches have appeared in the literature [28, 30]. The method proposed in [28] starts from the BWT merging phase of [27] to also build the LCP array using a small amount of memory. We point out that our paper and [28] are two independent works. Moreover, our focus is on an external-memory approach, which is not explicitly pursued in [28]. An extension to fully external memory computation of BWT and LCP of [28] is egap [33]. This method computes the data structures in three separate steps, (1) splitting the input sequences in subcollections such that the BWT can be computed in-memory, and (2-3) then merging them building the LCP along the way. The I/O and time complexities of egap are both $\mathcal{O}(mkl)$ , matching the ones of our algorithm when the alphabet is constant.

The method proposed in [30] (egsa) is a two-phase algorithm for the construction of the Generalized Enhanced Suffix Array, the LCP, and, optionally, the BWT. In the first step the required data structures are build for each sequence in input, whereas in the second step the output is produced by merging the data structures built previously. Although egap can be seen as an evolution of egsa, we included both tools in our experimental evaluation to highlight that building the Generalized Enhanced Suffix Array requires way more resources than directly building the BWT and the LCP.

The paper is laid out as follows. In Section 2 we provide the basic definitions we will use in the following. In Section 3 we give a high level description of the method proposed in this paper and illustrate the backward and forward strategies to merge partial arrays. In Section 4 and Section 5 we dive into the details of our algorithm. In Section 6 we analyze the time and I/O complexities of our method. In Section 7 we provide an experimental analysis of our tool and a comparison with other tools available in the literature. Finally, in Section 8 we recap the contributions of this paper.

2 Preliminaries

Let $\Sigma=\{c_{0},c_{1},\ldots,c_{\sigma}\}$ be a finite alphabet where $c_{0}=\$$ (called *sentinel*), and$ c_{0}<c_{1}<\cdots<c_{\sigma} $where$ < $specifies the lexicographic ordering over alphabet$ \Sigma $. We consider a collection$ S={s_{1},s_{2},\ldots,s_{m}} $of$ m $strings (reads), where each string$ s_{j} $consists of$ k_{j} $symbols over the alphabet$ \Sigma\setminus{$} $and is terminated by the symbol$ , such that $k_{j}+1$ is the total length of $s_{j}$ . The set $S$ is intended as a sequence of strings, where $s_{j}$ is the j- $th$ string in the set. The $i$ -th symbol of string $s_{j}$ is denoted by $s_{j}[i]$ , and the substring $s_{j}[i]s_{j}[i+1]\cdots s_{j}[t]$ of $s_{j}$ is denoted by $s_{j}[i:t]$ . We will refer to $s_{j}[k_{j}]$ as the last character of the string $s_{j}$ and is the character immediately before the sentinel. The suffix and prefix of $s_{j}$ with length $l$ are the substrings $s_{j}[k_{j}-l+1:k_{j}+1]$ (denoted by $s_{j}[k_{j}-l+1:]$ ) and $s_{j}[1:l]$ (denoted by $s_{j}[:l]$ ) respectively. Observe that $l$ counts, for the suffix, only the characters which are in $\Sigma\setminus\{\$ } $(excluding$ ). Then, the suffix and prefix with length $l$ of a string $s_{j}$ will be called the $l$ -suffix and $l$ -prefix of $s_{j}$ , respectively. In particular, the [math]-suffix is the suffix uniquely composed of the sentinel $. In the following we will denote with$ K $the total length of the input reads (including the sentinel$ ).

Given the lexicographic ordering $X$ of the suffixes of $S$ , the Suffix Array is the $K$ -long array $SA$ where the element $SA[i]$ is equal to $(p,j)$ if and only if the $i$ -th element of $X$ is the $p$ -suffix of string $s_{j}$ . We make the assumption that a suffix $s\$$ from string$ s_{i} $is lexicographically smaller than the identical suffix$ s$ $from a different string$ s_{j} $if$ i<j $. In other words, the two identical suffixes are ordered accordingly to the order of their origin strings in$ S $. This assumption guarantees the uniqueness of the Suffix Array for the collection$ S$.

The definition of suffix array we provide is slightly different than the one conventionally used, where $SA[i]=(p,j)$ refers to the suffix of the string $s_{j}$ starting at position $p$ . We have decided to abide by this definition to ease the presentation of the method in the following sections.

The Burrows-Wheeler Transform (BWT) of $S$ is the $K$ -long array $B$ where if $SA[i]=(p,j)$ , then $B[i]$ is the first symbol of the $(p+1)$ -suffix of $s_{j}$ if $p<k_{j}$ , otherwise $B[i]=\$$. In other words$ B $consists of the symbols preceding the ordered suffixes of$ X $, where the preceding symbol is the sentinel$ when the suffix is the complete string $s_{j}$ (i.e., the $k_{j}$ -suffix).

The Longest Common Prefix (LCP) array of $S$ is the $K$ -long array $LCP$ such that $LCP[i]$ is the length of the longest prefix shared by suffixes $X[i-1]$ and $X[i]$ . Conventionally, $LCP[1]=-1$ .

Now, we can give the definition of Interleave of a generic set of arrays, that will be used extensively in the following.

Definition 1.

Given $n+1$ arrays $V_{0},V_{1},\ldots,V_{n}$ , then an array $W$ is an interleave of $V_{0},V_{1},\ldots,V_{n}$ if $W$ is the result of merging the arrays such that: (i) there is a 1-to-1 function $\psi_{W}$ from the set $\cup_{i=0}^{n}\{(i,j):1\leq j\leq|V_{i}|\}$ to the set $\{q:1\leq q\leq|W|=\sum_{i}|V_{i}|\}$ , (ii) $V_{i}[j]=W[\psi_{W}(i,j)]$ for each $i,j$ , and (iii) $\psi_{W}(i,j_{1})<\psi_{W}(i,j_{2})$ for each $j_{1}<j_{2}$ .

The interleave $W$ is an array giving a fusion of $V_{0},V_{1}\ldots,V_{n}$ which preserves the relative order of the elements in each one of the arrays. As a consequence, for each $i$ with $0\leq i\leq n$ , the j- ${th}$ element of $V_{i}$ corresponds to the j- ${th}$ occurrence in $W$ of an element of $V_{i}$ . This fact allows to encode the function $\psi_{W}$ as an array $I_{W}$ such that $I_{W}[q]=i$ if and only if $W[q]$ is an element of $V_{i}$ . By observing that $W[q]$ is equal to $V_{I_{W}[q]}[r]$ where $r$ is the number of integers equal to $I_{W}[q]$ in $I_{W}[:q]$ , it is easy to show how to reconstruct $W$ from $I_{W}$ (see Algorithm 1 where the array $pos$ keeps, for each index $i$ from [math] to $n$ , such number $r$ while scanning array $I_{W}$ ).

In the following, we will refer to array $I_{W}$ as interleave-encoding (or simply encoding). Figure 1 shows an example of an interleave of four arrays and its encoding.

3 The algorithm

In this section we will provide a sketch of our algorithm. Let $k$ be the maximum length of a string in $S$ (excluding $) and let$ X_{l} $and$ B_{l} $($ 0\leq l\leq k $) be arrays of length at most$ m $such that$ X_{l}[i] $is the i-$ {th} $smallest$ l $-suffix among all the$ l $-suffixes of the strings of$ S $and$ B_{l}[i] $is the symbol preceding$ X_{l}[i] $. In particular,$ X_{0} $and$ B_{0} $list respectively the [math]-suffixes and the last characters of the input strings in their order in the set$ S $. Observe that$ B_{l} $is a subsequence of the BWT$ B $of$ S $, and it is easy to see that$ B $is an interleave of the$ k+1 $arrays$ B_{0},B_{1},\ldots,B_{k} $, since the ordering of symbols in$ B_{l} $($ 0\leq l\leq k $) is preserved in$ B$.

Similarly, the lexicographic ordering $X$ of all suffixes of $S$ is an interleave of the arrays $X_{0},X_{1},\ldots,X_{k}$ . Let $I_{B}$ be the encoding of the interleave of arrays $B_{0},B_{1},\ldots,B_{k}$ giving the BWT $B$ , and let $I_{X}$ be the encoding of the interleave of arrays $X_{0},X_{1},\ldots,X_{k}$ giving $X$ . Then it is possible to show that $I_{B}=I_{X}$ . Now, given $I_{B}$ it is immediate to reconstruct $B$ by using Algorithm 1.

In the following, we will call $B_{0},B_{1},\ldots,B_{k}$ and $X_{0},X_{1},\ldots,X_{k}$ as partial BWTs and partial Suffix Arrays, respectively. Figure 2 shows an example of partial BWTs and partial Suffix Arrays for a set of $m=3$ reads on alphabet $\{A,C,G,T\}$ , whose interleaves $B$ and $X$ (BWT and sorted suffixes, respectively) and the encoding $I_{B}=I_{X}$ are reported in the first, second and third columns of Figure 3.

Our algorithm for building the BWT $B$ and the LCP array consists of two distinct phases: in the first phase it computes each partial BWT $B_{l}$ ( $0\leq l\leq k$ ) by implicitly sorting the $l$ -suffixes of $S$ (see Section 4), while in the second phase it determines $I_{X}=I_{B}$ (see Section 5) by a merging algorithm inspired by [27] (for merging two BWTs), thus allowing to reconstruct $B$ as an interleave of $B_{0},\ldots,B_{k}$ . We slightly modified the approach in [27] in order to merge the arrays $B_{0},\ldots,B_{k}$ into the BWT $B$ by implicitly merging the array $X_{0},\ldots,X_{k}$ into the array $X$ (giving the lexicographic ordering of all suffixes of $S$ ). The second phase computes, together with the BWT $B$ , also the LCP array of the input set $S$ .

We note that the definition of partial BWTs and the method sketched here hint to some relation between the partial BWTs and the positional BWT (pBWT) presented in [34], although the latter is presented for an alphabet of size $2$ . Indeed, given a set of sequences, both reorder the characters at a given distance from one end of each sequence in input. More precisely, each partial BWT is an ordering of all the elements of the sequences at a given distance from the end of them, whereas each column of the pBWT is an ordering of all the elements at a given distance from the start of the sequences. In light of this fact, we can describe the two steps sketched in this section as follows: (i) build the pBWT of the input sequences reversed, and (ii) build the BWT and the LCP array by merging the columns of the pBWT. Although we will not describe our method in terms of pBWT in the following sections, we think that the connection we just highlighted further confirms strong relations between multiple BWT-like data structures presented thorough the years to index different structures (e.g., trees [35], de Bruijn graphs [36, 37, 38], and circular patterns [39]), as recently shown in [40].

Both phases of our method apply a Radix Sort strategy to reorder the suffixes (i.e., the $l$ -suffixes of $S$ in order to compute the partial BWT $B_{l}$ in the first phase, and the overall set of suffixes of $S$ in the second phase in order to compute $I_{B}$ ). The first phase iteratively computes the partial BWTs $B_{0},B_{1},\ldots,B_{k}$ . Each iteration $l$ ( $0\leq l\leq k$ ) computes $B_{l}$ from the order of the $l$ -suffixes (array $X_{l}$ ) implicitly computed by the previous iteration $l-1$ (array $X_{l-1}$ ). We point out that this algorithm adopts a LSD Radix Sort strategy that can be interpreted as “global”, since suffixes are sorted from the rightmost to the leftmost character (that is, it adopts a LSD strategy), and the order of $X_{l}$ is implicitly obtained from the order of $X_{l-1}$ without applying the radix sort to each one of the sets of $l$ -suffixes.

The second phase applies a MSD Radix Sort strategy since it reorders the suffixes from the leftmost to the rightmost characters, and can be performed in two different ways as described in the following section.

3.1 Backward and forward strategies for merging the partial BWTs

The encoding $I_{X}$ is basically computed by an iterative procedure starting by the trivial sorting given by taking first the suffixes of $X_{0}$ , followed by the suffixes of $X_{1}$ , followed by the suffixes of $X_{2}$ , etc., followed by the suffixes of $X_{k}$ (trivial interleave). Note that the encoding of the trivial interleave is given by $k$ runs of the integers from [math] to $k$ : that is, $|X_{0}|$ integers equal to [math], followed by $|X_{1}|$ integers equal to $1$ , etc., followed by $|X_{k}|$ integers equal to $k$ . Starting from that sorting, the procedure applies a MSD Radix Sort strategy to sort the suffixes of $S$ , by the first (leftmost) characters at the first iteration, then by the first two characters at the second iteration, etc., and finally by the first $k$ characters ( $k$ is the maximum length of the strings in the input set $S$ ) at the k- $th$ iteration. More precisely, at the p- $th$ iteration, it computes the encoding of the interleave, giving the sorting by the first $p$ characters, from the interleave giving the sorting by the first $p-1$ characters (computed at the previous iteration). At the k- $th$ iteration the computed encoding is clearly $I_{X}$ .

In the following, the interleave of arrays $X_{0},X_{1},\ldots,X_{k}$ , giving the sorting of the suffixes by the first $p$ characters will be called $p$ -interleave and denoted as $X^{p}$ , and its encoding will be denoted as $I_{X^{p}}$ . The encoding $I_{X}$ is clearly equal to $I_{X^{k}}$ (and $X$ is equal to $X^{k}$ ). We point out that $X^{p}$ is the list of all the suffixes in the input collection $S$ sorted by their prefixes of length $p$ . In other words, $X^{p}$ includes also suffixes shorter than $p$ . In this ordering, a suffix $s\$$ shorter than$ p $will come before any suffix having string$ s $as a prefix, and moreover such suffix will have the same position in all the orderings$ X^{q} $such that$ q>p$.

Iteration $p$ computes the encoding $I_{X^{p}}$ from the encoding $I_{X^{p-1}}$ obtained at the iteration $p-1$ . The first iteration $p=1$ computes $I_{X^{1}}$ from $I_{X^{0}}$ which is the trivial encoding composed of $k$ runs of the integers from [math] to $k$ ( $I_{X^{0}}$ is the encoding of the [math]-interleave giving trivially the suffixes sorted by the first [math] characters).

Two different strategies can be used for computing $I_{X^{p}}$ from $I_{X^{p-1}}$ , which are based on the two following observations.

Observation 1.

If $X^{p-1}[i]$ with length $l=I_{X^{p-1}}[i]$ is the $r$ -th suffix preceded by a symbol $c$ , then the suffix $cX^{p-1}[i]$ with length $l+1$ will be the $r$ -th suffix in $X^{p}$ starting with $c$ . Therefore, $I_{X^{p}}[j]$ will be equal to $l+1$ , such that $j=s+r$ , and $s$ is the number of symbols preceding suffixes of $X^{p-1}$ which are smaller than $c$ . Observe that, when $c=\$$, then$ cX^{p-1}[i] $is actually the empty suffix having length [math], and$ s$ is equal to [math].

Observation 2.

Let $[b,e]$ be the interval of positions related to the suffixes of $X^{p-1}$ sharing the first $p-1$ characters, and (among them) let us consider the $r$ -th suffix having a given $c$ at position $p$ . Then, such suffix will be at position $j=b+s+r,$ of $X^{p}$ ( $j\in[b,e]$ ), where $s$ is the number of suffixes in the interval $[b,e]$ having a symbol smaller than $c$ at position $p$ . Therefore, $I_{X^{p}}[j]$ will be equal to $l$ .

The first strategy (backward) is based on Observation 1 and consists in scanning the encoding $I_{X^{p-1}}$ . $|\Sigma|$ empty buckets are initialized (one for each alphabet symbol), and for each length $l=I_{X^{p-1}}[i]$ , the symbol $c$ preceding the related suffix is obtained from the partial BWTs ( $c$ is indeed the $i$ -th symbol of the interleave of the partial BWTs encoded by $I_{X^{p-1}}$ , and will be the $t$ -th of vector $B_{l}$ if suffix $X^{p-1}[i]$ is the $t$ -th suffix in $X^{p-1}$ having length $l$ ). At this point, the length $l+1$ is added to the bucket related to symbol $c$ , if $c$ is not the sentinel $, otherwise the value [math] is added (to the$ bucket). At the end of the iterations, the concatenation of the buckets following the lexicographical order of the symbols, provides the encoding $I_{X^{p}}$ .

The second strategy (forward) is based on Observation 2 and maintains a partitioning of the generic encoding $I_{X^{p}}$ into contiguous segments, which are called $p$ -segments. A $p$ -segment is an interval of positions which are related to suffixes sharing the first $p$ characters. The forward strategy consists in scanning the $p-1$ -segments of the encoding $I_{X^{p-1}}$ one after the other and uses $|\Sigma|$ initially empty buckets. For each $p-1$ -segment $[b,e]$ , its $(e-b+1)$ suffixes are considered, and each suffix length $l$ is added to a bucket depending on the symbol at position $p$ in the suffix. At the end of the iterations over the $p-1$ -segment, the concatenation of the buckets following the lexicographical order of the symbols, provides the encoding $I_{X^{p}}$ between positions $b$ and $e$ .

Both algorithms compute the interleave $I_{X}$ giving the Suffix Array as defined in Section 2 whose uniqueness is guaranteed by the radix sort strategy.

Figure 4 shows an example for the two strategies applied to a set of three reads.

In Section 5 the backward strategy will be detailed alongside the computation of the LCP array. We refer to [41, 42] for the details about the forward strategy.

4 Computing the partial BWTs

The first phase of the method computes the partial BWTs $B_{0},\ldots,B_{k}$ by first preprocessing the input strings $s_{1},\ldots,s_{m}$ in order to obtain $k+1$ arrays $T_{0},\ldots,T_{k}$ with length $m$ , where $T_{l}$ lists the characters such that $T_{l}[i]=s_{i}[|s_{i}|-l]$ when $0\leq l\leq|s_{i}|-1$ , $T_{l}[i]=\$$ when$ l=|s_{i}| $, and$ T_{l}[i]=# $when$ l>|s_{i}| $(where$ # $is a padding symbol not belonging to the alphabet of the input strings). Section (a) of Figure [5](#S4.F5) reports an example of arrays$ T_{l} $for the three strings of Figure [2](#S3.F2). Observe that$ T_{0} $lists the last characters$ \langle s_{1}[k_{1}],s_{2}[k_{2}],\ldots,s_{m}[k_{m}]\rangle $of the input strings in the same order the strings have in the set$ S $, and$ T_{0} $is clearly equal to$ B_{0} $. Observe that$ T_{l}[i] $(when different from$ # $) is the symbol preceding the$ l $-suffix of string$ s_{i}$.

The preprocessing step is a trivial task that iterates over the input strings and outputs the $k+1$ arrays $T_{0},\ldots,T_{k}$ . We can summarize this procedure as a loop that iterates over the input strings and performs the following steps. Let $s_{i}$ be the input string and suppose that we already preprocessed the previous $i-1$ sequences. We first reverse $s_{i}$ and produce the string $r_{i}$ . Then, for each position $l$ of $r_{i}$ , we write the $l$ -th character of $r_{i}$ at position $i$ of array $T_{l}$ , padding this array with # if it includes less than $i-1$ elements. Finally, we write $\$$ at position$ i $of array$ T_{|s_{i}|}$ (padding it with # if required) and move to the next sequence.

The partial BWTs $B_{0},\ldots,B_{k}$ are computed by Algorithm 2 which receives in input the arrays $T_{0},\ldots,T_{k}$ , and performs $k+1$ iterations. Iteration $l$ ( $0\leq l\leq k$ ) computes $B_{l}$ from $X_{l}$ which is implicitly known and implicitly determines $X_{l+1}$ to be used in the next iteration. More in detail, the ordering of array $X_{l}$ is known by means of a array $N_{l}$ , with length at most $m$ , such that $N_{l}[i]=q$ if and only if the i- ${th}$ element of $X_{l}$ is the $l$ -suffix from the input string $s_{q}$ . In other words, position $i$ of array $N_{l}$ gives the index $q$ of the string whose $l$ -suffix is the i- $th$ in $X_{l}$ . The partial BWT $B_{l}$ can be computed (see cycle at line 2) by scanning $N_{l}$ , since $B_{l}[i]$ is (by definition) the symbol preceding the $l$ -suffix of the string $s_{q}$ , where the index $q$ is equal to $N_{l}[i]$ , and can be obtained by accessing array $T_{l}$ . Indeed, $B_{l}[i]=T_{l}[q]$ . Observe that $B_{l}$ is treated by Algorithm 2 as a list initially empty, and the symbol $c$ is appended to $B_{l}$ only if it is not the padding symbol # (signaling that the originating string is shorter than $k$ ). Note that, at the first iteration $l=0$ , $N_{0}$ (which is set in cycle at line 2) is the sequence of indices $\langle 1,2,\ldots,m\rangle$ , and $B_{0}$ is correctly computed as the sequence of the last characters $\langle s_{1}[k_{1}],\ldots,s_{m}[k_{m}]\rangle$ (i.e., $T_{0}$ ).

At the same time, the iteration $l$ computes the array $N_{l+1}$ to be used in the next iteration $l+1$ in order to compute the partial BWT $B_{l+1}$ . Observe in fact that the i- $th$ $l$ -suffix of $X_{l}$ is preceded by the symbol $c=T_{l}[q]$ , where $q=N_{l}[i]$ , and belongs to string $s_{q}$ . Assuming that the i- $th$ suffix of $X_{l}$ is the h- $th$ suffix of $X_{l}$ which is preceded by that symbol $c$ , then the $l+1$ -suffix of $s_{q}$ is the h- $th$ suffix of $X_{l+1}$ starting with $c$ . Furthermore, let us assume that there are $r$ $l$ -suffixes of $X_{l}$ starting with a symbol smaller $c$ . Then the $l+1$ -suffix of $s_{q}$ is the (r+h)- $th$ suffix of $X_{l+1}$ . By definition, it holds $N_{l+1}[r+h]=q$ . The algorithm uses $\sigma+1$ lists $\mathcal{P}(\cdot)$ , a list for each symbol in $\Sigma$ , which are created at the beginning of iteration $l$ . During the scanning of $N_{l}$ the index $q$ is added to the list $\mathcal{P}(c)$ .

It is easy to prove that, at the end of iteration $l$ , the concatenation of lists $\mathcal{P}(\cdot)$ (according to the order of the symbols in $\Sigma$ ) correctly gives $N_{l+1}$ . Note that $N_{l+1}$ is computed also by the last iteration $k$ , even though it is actually not used. Figure 5 exemplifies the iteration $l=1$ of Algorithm 2 which computes, for the set of reads of Figure 2, the partial BWT $B_{1}$ (see cycle for at line 2) and the array $N_{2}$ (line 2), from the array $N_{1}$ (computed by the previous iteration $l=0$ ). Array $N_{2}$ will be used by the next iteration $l=2$ for computing the partial BWT $B_{2}$ .

Observe that arrays $T_{i}$ must be kept in main memory, since they are not accessed sequentially (see Algorithm 5), and for this reason they cannot be stored in external memory.

5 Backward strategy for computing the encoding $I_{B}$ and the LCP array

This section is devoted to describe the second step of our algorithm which computes the BWT $B$ and the LCP array according to the backward strategy described in Section 3.

First of all, we describe in detail how the single iteration works (see Algorithm 3). Then, we show how to enrich Algorithm 3 in order to compute also the LCP array together with the encoding $I_{X}$ (see Algorithm 4). Finally, the complete procedure for computing $I_{X}=I_{B}$ , from the partial BWTs $B_{l}$ , is presented (see Algorithm 5) and is explained how to use the LCP array values in order to limit the iterations to the number strictly necessary to obtain $I_{X}$ .

At this point, let us assume to have (iteration $p$ ) the encoding $I_{X^{p-1}}$ of the $p-1$ -interleave $X^{p-1}$ . We want to compute the encoding $I_{X^{p}}$ of the $p$ -interleave $X^{p}$ , by sorting the suffixes of $X^{p-1}$ by the first $p$ characters. The algorithm implicitly obtains $X^{p}$ (suffixes sorted by the first $p$ characters) by implicitly reordering the characters preceding each one of the suffixes of $X^{p-1}$ (suffixes sorted by the first $p-1$ characters). We note that (by definition) for any $p$ from [math] to $k$ the first $m$ entries of $I_{X^{p}}$ are all equal to [math]. Indeed, the $m$ [math]-suffixes (of the set $S$ ) occupy always the first $m$ positions for any value of $p$ .

Before entering the details of iteration $p$ (see Algorithm 3), we give the idea of the algorithm. Let us consider the suffix $X^{p-1}[q]$ whose length is $l=I_{X^{p-1}}[q]$ . Let $c$ be the symbol preceding such suffix. Let $\mathcal{X}^{p-1}_{s}$ be the subset of suffixes of $X^{p-1}$ preceded by a symbol smaller than $c$ , and let $\mathcal{X}^{p-1}_{e}$ be the subset of suffixes at a position $q^{\prime}<q$ of $X^{p-1}$ preceded by the symbol $c$ . It is easy to show that the suffix $cX^{p-1}[q]$ (with length $l+1$ ) is greater (by the first $p$ characters) than all and only those suffixes $c_{p}x$ , such that $x\in\mathcal{X}^{p-1}_{s}\cup\mathcal{X}^{p-1}_{e}$ and $c_{p}$ is the symbol preceding $x$ . Therefore, the suffix $cX^{p-1}[q]$ (with length $l+1$ ) will occupy in $X^{p}$ the position $q^{\prime}=|\mathcal{X}^{p-1}_{s}\cup\mathcal{X}^{p-1}_{e}|+1$ , and $I_{X^{p}}[q^{\prime}]$ will be equal to $l+1$ . In other words, position $q^{\prime}$ of suffix $cX^{p-1}[q]$ on $X^{p}$ is given by the sum $h_{s}+h_{e}+1$ where $h_{s}$ is the number of suffixes of $X_{p-1}$ preceded by a symbol smaller than $c$ and $h_{e}$ is the number of suffixes, which are before $X^{p-1}[q]$ and preceded by symbol $c$ .

Algorithm 3 creates a set of $\sigma+1$ lists $\mathcal{I}(c_{0}),\mathcal{I}(c_{1}),\ldots,\mathcal{I}(c_{\sigma})$ containing at the end of iteration $p$ the partitioning of the encoding of $I_{X^{p}}$ by the first character $c_{i}$ of the suffixes of $X^{p}$ . Since the list $\mathcal{I}(c_{0})$ (we recall that $c_{0}=\$$) is related to the [math]-suffixes, then it is fixed over the iterations$ p $and is always composed of$ m $[math]s (and it is initialized at the beginning of the procedure). Therefore, at the end, the algorithm produces$ I_{X^{p}} $(see line [3](#algorithm3)) as the concatenation$ \mathcal{I}(c_{0})\mathcal{I}(c_{1})\cdots\mathcal{I}(c_{\sigma}) $. In order to fill the lists$ \mathcal{I}(\cdot) $, Algorithm [3](#algorithm3) performs a scan of$ I_{X^{p-1}} $. For each position$ q $, it obtains$ l=I_{X^{p-1}}[q] $, that is the length of the$ q $-$ th $suffix of$ X^{p-1} $, and the symbol$ c $preceding such suffix (see line [3](#algorithm3)). Vector$ pos $allows to read$ c $from the correct position of array$ B_{l} $. If$ c\neq$ $, then$ l $is not greater than the length of the input string originating the suffix$ X^{p-1}[q] $, and the integer$ l+1 $is appended to the list$ \mathcal{I}(c) $. Otherwise, if$ c=c_{0}=$ $, it moves to the next position$ q+1 $. Indeed in this case, the value$ l $is greater than the length of the input string originating the suffix$ X^{p-1}[q] $, thus the$ cX^{p-1}[q] $obtained is a [math]-suffix whose related integer [math] should be appended to the list$ \mathcal{I}(c_{0})$, which is fixed (by definition) over the iterations.

This approach is alternative to the one presented in [26] first and then implemented in [41]. In fact, the iteration $p$ is a backward extension of the suffixes sorted by the first $p-1$ characters in order to obtain the suffixes sorted by the first $p$ characters. Instead the strategy presented in [26] is based on a forward extension of the $(p-1)$ -prefixes of the suffixes in order to obtain the ordering given by the encoding $I_{X^{p}}$ .

The following theorem proves the correctness of Algorithm 3.

Theorem 1.

If Algorithm 3 receives in input the encoding $I_{X^{p-1}}$ of the $p-1$ -interleave $X^{p-1}$ , then it computes the encoding $I_{X^{p}}$ of the $p$ -interleave $X^{p}$ .

Proof.

Observe that the $(p-1)$ -prefix (prefix with length $p-1$ ) of the i- $th$ suffix of $X_{l}$ is the suffix of the $p$ -prefix of a suffix of $X_{l+1}$ , starting with the symbol $c=B_{l}[i]$ . Then, line 7 of Algorithm 3 appends length $l+1$ to the list $\mathcal{I}(c)$ . Observe that line 3 implicitly computes a partitioning of the suffixes in $X^{p}$ , according to their starting symbol, into lists $\mathcal{I}(c_{0}),\mathcal{I}(c_{1}),\ldots,\mathcal{I}(c_{\sigma})$ , where $\mathcal{I}(c_{i})$ gives the ordering, by the first $p$ characters, of the suffixes starting with symbol $c_{i}$ . Each list $\mathcal{I}(c_{i})$ contains (at line 3) the lengths of such suffixes.

Furthermore, given two distinct suffixes $c_{1}x_{1}$ and $c_{2}x_{2}$ such that $c_{1}x_{1}$ is smaller (by the first $p-1$ characters) than $c_{2}x_{2}$ , either they begin with two different symbols $c_{1}<c_{2}$ , or they both start with the same symbol, i.e., $c_{1}=c_{2}$ . Let $L(c_{1})$ and $L(c_{2})$ be the partitions of $X^{p-1}$ containing the suffixes starting with $c_{1}$ and $c_{2}$ (respectively). Then, in $X^{p}$ all suffixes in $L(c_{1})$ precede those in $L(c_{2})$ . Inside the list $L(c_{1})$ , the ordering of two suffixes $c_{1}x_{i}$ and $c_{1}x_{j}$ by the first $p$ characters is the same as in $X^{p-1}$ . Indeed, $cx_{i}[:p-1]$ is lexicographically smaller than $cx_{j}[:p-1]$ if and only if $x_{i}[:p-1]$ is lexicographically smaller than $x_{j}[:p-1]$ . It follows that $X^{p}$ consists of the concatenation of $L(c_{i})$ according to the lexicographic ordering of symbols of alphabet $\Sigma$ , and thus line 3 of Algorithm 3 computes the encoding $I_{X^{p}}$ of $X^{p}$ . ∎

In the following we will describe how to compute the LCP array of the input dataset. Similarly to the computation of the BWT $B$ , the LCP array will be constructed iteratively. More precisely, the LCP array will be constructed by considering prefixes of the suffixes by increasing length. At this point, we can describe how to update Algorithm 3 (iteration $p$ ) in order to compute (at the end of the iterations) also the LCP array.

To this aim we must introduce the following definition.

Definition 2.

Given the LCP array, $\mathit{LCP}_{p}$ is defined such that $\mathit{LCP}_{p}[i]=\min\{LCP[i],p\}$ .

Observe that $\mathit{LCP}_{p}[i]$ is the length of the longest prefix shared by the $p$ -prefix of $X^{p}[i]$ and the $p$ -prefix of $X^{p}[i-1]$ . We note that, when a suffix in $X^{p}$ is shorter than $p$ , then its $p$ -prefix (considered for $LCP_{p}$ ) is the whole suffix itself ($ excluded).

The array $\mathit{LCP}_{k}$ is equal to the LCP array of the input set $S$ , and $\mathit{LCP}_{0}$ contains all [math]s, except for $\mathit{LCP}_{0}[1]$ that is equal to $-1$ . In Figure 7 $\mathit{LCP}_{0}$ and $\mathit{LCP}_{1}$ are reported for the input set of Figure 2.

The LCP array is computed iteratively by starting from $\mathit{LCP}_{0}$ . Now we describe the single iteration $p$ for computing $\mathit{LCP}_{p}$ from $\mathit{LCP}_{p-1}$ . Algorithm 4 extends Algorithm 3 in order to compute $I_{X^{p}}$ and $\mathit{LCP}_{p}$ from $I_{X^{p-1}}$ and $\mathit{LCP}_{p-1}$ .

Algorithm 4 builds a set of $\sigma+1$ lists $\mathcal{L}(c_{0}),\mathcal{L}(c_{1}),\ldots,\mathcal{L}(c_{\sigma})$ containing the partitioning of the elements of $\mathit{LCP}_{p}$ by the first character $c_{i}$ ( $0\leq i\leq\sigma$ ) of the related suffix. Since the list $\mathcal{L}(c_{0}=\$ ) $is related to the [math]-suffixes, it is fixed for any iteration and is composed of$ -1 $followed by$ m-1 $[math]s. Moreover, observe that the first element of each list$ \mathcal{L}(c_{i}) $($ 1\leq i\leq\sigma $) is always [math]. Finally, Algorithm [4](#algorithm4) concatenates all the lists$ \mathcal{L}(c_{0}),\mathcal{L}(c_{1}),\ldots,\mathcal{L}(c_{\sigma}) $, thus producing$ \mathit{LCP}_{p}$ (see line 4).

Before giving the detail of computing the single lists $\mathcal{L}(\cdot)$ , we need to introduce the following function. Given a position $q$ and a symbol $c\neq\$$, the function$ \alpha_{p}(q,c) $is the length of the longest prefix shared by the$ p $-prefixes of suffixes$ X^{p}[q] $and$ X^{p}[h] $where$ h $is the biggest position before$ q $related to a suffix$ X^{p}[h] $preceded by symbol$ c $. If such$ h $does not exist, then$ \alpha_{p}(q,c)=-1$.

In the following, given two strings $x_{1},x_{2}$ , we denote (respectively) by $lcp_{p}(x_{1},x_{2})$ and $lcp(x_{1},x_{2})$ the length of the longest common prefix between the $p$ -prefixes of $x_{1}$ and $x_{2}$ , and the length of the longest common prefix between $x_{1}$ and $x_{2}$ (that is, $lcp(x_{1},x_{2})=lcp_{k}(x_{1},x_{2})$ ). The following proposition relates the values of $\alpha_{p-1}(q,c)$ and $\mathit{LCP}_{p}$ and it is a direct consequence of their definitions.

Proposition 1.

Let $cx_{1}$ and $cx_{2}$ be two consecutive suffixes of $X^{p}$ , and let $x_{2}$ be the q- $th$ suffix of $X^{p-1}$ . Then $\min\{p,lcp(cx_{1},cx_{2})\}=1+\alpha_{p-1}(q,c)$ .

During the scan of the encoding $I_{X^{p-1}}$ , the value $\mathit{LCP}_{p-1}[i]$ is obtained (see line 13 of Algorithm 4). The function $\alpha_{p-1}(q,c)$ is maintained in the array $\alpha$ of size $\sigma-1$ initially set to $\sigma-1$ values $-1$ s, and updated in the cycle at line 4. The main invariant of Algorithm 4 is that, at line 4, the variable $\alpha[c]$ is equal to $\alpha_{p-1}(q,c)$ —this invariant is a consequence of the following Lemma 1 and can be proved by a direct inspection of Algorithm 4. The value $\alpha[c]$ incremented by $1$ is appended to the list $\mathcal{L}(c)$ .

Lemma 1.

Let $x_{1}$ and $x_{2}$ be respectively the j- $th$ and the q- $th$ suffixes of $X^{p-1}$ , such that $j<q$ , and let $c$ be the symbol preceding suffix $x_{1}$ . If every suffix at a position $t$ between $j$ and $q$ ( $j<t<q$ ), is not preceded by the symbol $c$ , then it holds that $\alpha_{p-1}(q,c)=\min_{j<h\leq q}\{\mathit{LCP}_{p-1}[h]\}$ .

Proof.

Since $c$ is not the symbol that precedes the suffix at position $t$ with $j<t<q$ , then by definition of $\alpha_{p-1}(q,c)$ , it must be that $\alpha_{p-1}(q,c)=lcp_{p-1}(X^{p-1}[j],X^{p}[q])$ , since the $j$ is the largest integer less than $q$ for which the j- $th$ suffix is preceded by symbol $c$ . Since it is immediate to verify that $lcp_{p-1}(X^{p}[j],X^{p}[q])=\min_{j<h\leq q}\{\mathit{LCP}_{p-1}[h]\}$ , the lemma easily follows. ∎

The previous argument allows us to prove the following theorem which, combined with Theorem 1 completes the correctness of Algorithm 4.

Theorem 2.

Given as input $\mathit{LCP}_{p-1}$ and the partial BWTs $B_{0},B_{1},\ldots,B_{k}$ , Algorithm 4 computes $\mathit{LCP}_{p}$ .

Proof.

Observe that $\alpha[c]\geq 0$ at line 4 iff the current suffix at position $q$ is not the first to be preceded by the character $c$ , hence we must append the value $1+\alpha_{p-1}(q,c)$ to $\mathcal{L}(c)$ . Since $\alpha[c]=\alpha_{p-1}(q,c)$ , the theorem is proved. ∎

In Figure 6 the computation of $I_{X^{1}}$ from $I_{X^{0}}$ (by Algorithm 3 for $p=1$ ) is shown for the set $S$ of reads presented in Figure 2. The encodings $I_{X^{1}}$ and $I_{X^{0}}$ are reported in Figure 7 together with $LCP_{1}$ and $LCP_{0}$ whose computation (by Algorithm 4) has been omitted for simplicity.

The procedure BWT+LCP (see Algorithm 5) computes $I_{X^{k}}$ and $\mathit{LCP}_{k}$ , which are the encoding of the BWT and the LCP array of the input set $S$ of strings, by iterating Algorithm 4. Iterations stop when the maximum value $\max_{q}\{\mathit{LCP}_{p}[q]\}$ in the array $\mathit{LCP}_{p}$ is less than $p$ . In fact, it means that for an iteration $t>p$ , the values $I_{X^{t}}$ and $\mathit{LCP}_{t}$ do not change since the suffixes have been fully sorted and thus $I_{X^{t}}$ and $\mathit{LCP}_{t}$ remain equal to $I_{X^{k}}$ and $\mathit{LCP}_{k}$ , respectively. The correctness of the procedure BWT+LCP is a consequence of Theorem 2 and Definition 2. Observe that if the maximum value in the LCP array is equal to $z$ , then at each iteration $p$ of Algorithm 5 with $p\leq z$ , the maximum value in $\mathit{LCP}_{p}$ is $p$ , in virtue of Theorem 2 and Definition 2. When $p=z+1$ , then by Definition 2, the iteration $p$ gives value $z$ , that is $\max_{q}\{\mathit{LCP}_{p}[q]\}<p$ . Then the suffixes have been fully sorted and the LCP array has been computed at the previous step $p=z$ .

Observe that, in virtue of the radix sort strategy, the two steps of our method (computing the partial BWTs and computing the interleave $I_{X}$ ) do not depend on the particular order of the strings in the input set $S$ . For this reason, there is no particular order of the input strings which may improve the computation.

5.1 Comparison with other strategies

While a common element of our method with egap and BEETL is the use of a radix sort strategy, a main difference is represented by the collection of objects to which it is applied. BEETL’s algorithm works by a unique step and is based on the following invariant: at the iteration $p$ , it computes the partial BWT for the collection of suffixes of length at most $p$ . Differently, our algorithm works by two steps: first it computes the partial BWTs $B_{l}$ (as previously defined) and then the interleave $I_{X}$ . In the second step the following invariant is maintained: at the iteration $p$ , it computes the list of the symbols preceding all the suffixes in the input collection $S$ sorted by the $p$ -long prefixes. As a consequence, at the iteration $p$ , it computes a permutation of the BWT for $S$ tending to the solution over the iterations, while BEETL computes a subsequence of the BWT for $S$ and maintains over the iterations the reciprocal order between the symbols.

Arrays $N_{l}$ used by the first step of our algorithm (computing the partial BWTs $B_{l}$ ) are the same as arrays $N_{l}$ used in [16]. Indeed, $N_{l}[i]$ in our case is the position in $S$ of the string which is the origin of the $i$ -th $l$ -suffix $X_{l}[i]$ whose preceding symbol is $B_{l}[i]$ . Arrays $N_{l}(h)$ in [16] are defined such that $N_{l}(h)[i]$ is the position in $S$ , of the string which is the origin of the $i$ -th $l$ -suffix (in the partial BWT) starting with the $h$ -th symbol $c_{h}$ of the alphabet. We note that the concatenation $N_{l}(0)N_{l}(1)...N_{l}(\sigma)$ gives the array $N_{l}$ of our algorithm.

Observe that both our algorithm and egap use the notion of an interleave in order to compute the BWT and the LCP array. More precisely, egap splits the input collection $S$ into subcollections sufficiently small, then it computes the BWT (partial BWTs) for each subcollection and finally it merges the BWTs similarly to the approach in [27]. On the other hand, our algorithm first computes the partial BWTs $B_{l}$ from the whole collection $S$ , that are then merged maintaining the invariant property described above.

6 Complexity

In this section we will analyze the computational and I/O volume of our algorithm.

First we will analyze Algorithm 2. This procedure mainly consists of two nested loops in which each operation requires constant time. If the input is a set of $m$ strings of length $k$ , the time complexity of it is $\mathcal{O}(mk)$ . Note that each of the $k+1$ lists $B_{l}$ and $N_{l}$ have $m$ elements which are read or written sequentially and, moreover, each list is read only once per execution. Hence, the I/O volume of Algorithm 2 is $\mathcal{O}(mk\lg m)$ since, for each element in $T_{0},\ldots,T_{k}$ , Algorithm 2 appends an integer less than $m$ to the correct list $\mathcal{P(\cdot)}$ that we can store on disk, since we access them sequentially.

Besides some $\mathcal{O}(1)$ -space data structures, the algorithm uses $\sigma+1$ lists $\mathcal{P}(\cdot)$ to store pointers to the open files and $k+1$ arrays $T_{0},T_{1},\ldots,T_{k}$ to store the characters of the sequences. Note that, at each iteration of the loop at line 2, only one array $T_{l}$ must be kept in main memory, since we need to perform non-sequential accesses, and requires $m\lg\sigma$ bits—notice that for one million DNA reads, that translates to 256 Mbytes of memory, which is well below the RAM amount found in standard PCs. Therefore, if we can address each file using $w$ bits, the main memory requirement of Algorithm 2 is $\mathcal{O}(\sigma w+kw+m\lg\sigma)$ bits.

Furthermore, arrays $T_{l}$ ( $0\leq l\leq k$ ) can be computed in $\mathcal{O}(km)$ time and $\mathcal{O}(km\lg\sigma)$ I/O volume by reading sequentially $\mathcal{B}$ input strings at a time and producing $\mathcal{B}$ positions of arrays $T_{l}$ , where $\mathcal{B}$ is the disk block size, that is the number of characters that are read or written in a single disk operation (see [43]). Notice that this step requires to keep $l\times\mathcal{B}$ characters in main memory: this is not a problem for Bioinformatics applications, since short reads are at most a few hundreds of characters long, and even longer reads are at most 20000 characters long. Anyway, it is possible to adapt the algorithm of [44, 45] to compute the arrays $T_{l}$ arrays with $\mathcal{O}(\mathcal{B}^{2})$ main memory.

We will now analyze Algorithm 4. The time complexity of this procedure is $\mathcal{O}(mk\sigma)$ since such procedure is composed of a for loop that iterates over the encoding $I_{X^{p-1}}$ —whose length is $mk$ —performing constant time operations per element except for the loop at lines 4–4 that requires $\mathcal{O}(\sigma)$ time.

The I/O volume is $\mathcal{O}(mk\max\{\lg m,\lg l\})$ bits, since each iteration of the loop at lines 4–4 requires to read and write a constant number of elements of some lists whose values are bounded by $m$ or $l$ , and since $\alpha$ is kept in main memory. The main memory usage is $\mathcal{O}(\sigma\lg l+kw)$ bits, since we store $\sigma$ integers smaller than $l$ in $\alpha$ and $k$ pointers to the lists $B_{i}$ .

We can now analyze Algorithm 5, which is composed of two main steps: in the first one it prepares the input data structures (line 5), invokes Algorithm 2, and initializes some data structures. In the second part (lines 5–5) it computes the final encoding $I_{X^{p}}$ and the LCP array from the structures computed at the previous step by iteratively applying Algorithm 4.

The complexity of the first part is essentially that of Algorithm 2, since computing the lists $T_{0},\ldots,T_{k}$ (line 5) requires $\mathcal{O}(mk)$ with a single scan of the input data (whose size is $mk$ ), while outputting the lists requires constant time per element.

The second step is mainly composed of a while loop that iteratively applies Algorithm 4 (that requires $\mathcal{O}(mk\sigma)$ ) to compute the final interleave and the final LCP array. Moreover, the proof of correctness of Algorithm 5 also shows that Algorithm 4 is applied $l+1$ times, where $l$ is the largest value in the LCP array.

Finally, Algorithm 5 builds the final BWT from $I_{X^{P}}$ and the lists $B_{0},\ldots,B_{k}$ by a single scan of those $m$ -long lists, which requires $\mathcal{O}(mk)$ time overall. Therefore, Algorithm 5 requires an overall $\mathcal{O}(mkl\sigma)$ time.

The I/O complexity of the first step is $\mathcal{O}(\max\{mk\lg m,mk\lg\sigma\})$ bits whereas the main memory requirement is $\mathcal{O}(\sigma w+kw+m\lg\sigma)$ bits. Indeed, computing the lists $T_{0},\ldots,T_{k}$ at line 5 requires us to store only one character per time of each sequence $s_{i}$ and to append it to the correct list: therefore it has $\mathcal{O}(mk\lg\sigma)$ bits I/O volume and $\mathcal{O}(kw+\lg\sigma)$ bits main memory requirement. We have to include the requirements of Algorithm 2, which changes the main memory needed for the first step to $\mathcal{O}(\sigma w+kw+m\lg\sigma)$ bits.

The I/O volume of the second step is $\mathcal{O}(mkl\lg l)$ bits since it consists essentially of $l$ applications of Algorithm 4. Finally, while building the final BWT from $I_{X^{P}}$ , Algorithm 5 reads $\mathcal{O}(mk\lg m)$ bits due to the interleave and $\mathcal{O}(mk\lg\sigma)$ bits due to the partial BWTs, writes $\mathcal{O}(mk\lg\sigma)$ bits for the final BWT and requires $\mathcal{O}(\max\{\lg l,\lg\sigma\})$ bits of main memory since at most it stores in main memory one element of $I_{X^{P}}$ and one element of a partial BWT.

Therefore, overall Algorithm 5 reads and writes $\mathcal{O}(mkl\max\{\lg m,\lg l\})$ bits from and to the disk and requires $\mathcal{O}(\sigma w+kw+m\lg\sigma+\lg l)$ bits of main memory. We can summarize our results as follows.

Proposition 2.

Given as input a set composed of $m$ strings of length $k$ over and alphabet of size $\sigma$ , the procedure BWT+LCP computes the BWT and the LCP array of it in $\mathcal{O}(mkl\sigma)$ time, where $l$ is the maximal value of the LCP array. This procedure requires to store in main memory $\mathcal{O}(\sigma w+kw+m\lg\sigma+\lg l)$ bits and reads and writes from and to the disk $\mathcal{O}(mkl\max\{\lg m,\lg l\})$ bits.

Note that, if $\sigma$ is constant then the time complexity of the method presented in this paper becomes $\mathcal{O}(mkl)$ . Moreover, if the word size is $\max\{w,\lg m,\lg l\}$ then its I/O volume and main memory requirement become $\mathcal{O}(mkl)$ and $\mathcal{O}(k+m)$ respectively.

7 Results

We implemented the method proposed in this article in a prototype in C, named bwt-lcp-em (we will refer to it as ble in the following) that is freely available at https://github.com/AlgoLab/bwt-lcp-em. We compared our method with other tools specifically designed to index datasets composed by a huge number of short sequences such as Next-Generation Sequencing read sets.

We have compared ble with the original implementation of extLCP [22] (BEETL), as well as a more recent version (BEETL2) that implements a fully external memory approach111The second version is available at https://github.com/giovannarosone/BCR_LCP_GSA., the in-memory method gsa-is [46], and two recent external memory tools (egsa [30] and egap [33]). Notice that gsa-is and egsa have been designed to compute the suffix array of a set of strings, so the computation of the BWT and of the LCP array is likely not optimized.

We used some non-default values for some of the parameters in order to minimize main memory usage: egap has been run with --lbytes 1, BEETL with --memory-limit=900, and egsa by compiling with MEMLIMIT=900. We allowed egap to use 95% of the available RAM, as suggested in its website.

We compared the considered tools in the scenario of 1GB of main memory available, by considering instances with 1, 2, 4, 8, 16, and 32 million sequences, taken from two different data sources: (a) 148bp Illumina reads from the Genome In A Bottle (GIAB) [47] consortium, more precisely from the NA24385 individual; (b) random sequences of length $151$ generated by a Python script that builds uniformly distributed fixed-length sequences over the DNA alphabet (for the extended and detailed experimentation results we refer the read to the GitHub repository of the implementation). The goal of using these two datasets is to experimentally assess the theoretical time complexity of our approach that shows a dependency on $l$ , i.e., the maximum value of the LCP array (see Section 6). We expect that $l$ in the random datasets will be considerably less than the length $k$ of input strings. On the other hand, the Illumina datasets represent the worst case of our approach as they have a 300 $\times$ mean coverage. Hence they will surely contain duplicate reads and $l$ will be equal to the length $k$ of the input strings.

We ran all the experiments on the same workstation running Ubuntu Linux 18.04 equipped with an Intel Core i7-4770 CPU running at 3.40GHz and a 256GB solid state disk. The machine is equipped with 8GB of RAM, and we limited the amount of RAM at boot time to 1GB to avoid the effects of OS caching.

Table 1 reports the time (in minutes) required to compute the BWT and the LCP array. The first column indicates the number of sequences in the dataset, whereas column $l$ indicates the maximum value of the LCP array on that dataset. Symbol $\star$ means that the tool could not complete the execution in our environment because it needed more than 1GB of RAM, while symbol $\diamond$ means that the tool could not complete the execution since it required more disk space than that available. Notice that BEETL and gsa-is did not complete some executions because of the RAM limit, while egsa required more disk space than that available. To better highlight the trends, Figure 8 visually depicts the same results presented in the table. We expect that this experiment will show the advantages of memory-conscious approaches.

The results point out that only ble, BEETL, BEETL2, and egap were able to deal with such a limited amount of main memory. Moreover, ble, BEETL2, and egap were able to compute the BWT and LCP array for all the datasets, while gsa-is and egsa could only cope with smaller datasets. Since gsa-is is an in-memory approach, its memory requirements made impossible to process even moderately large instances.

On the NA24385 dataset, egap is the fastest tool on the instances up to 8M reads, while BEETL2 is the fastest on larger instances. Still, the trend in the running time hints that ble will likely become the fastest tool on instances larger than those considered here.

On the random dataset, ble is the fastest tool on all instances with at least 2 million reads—egap is the fastest on 1 million reads. egap and ble are always the two fastest tools.

The comparison of the running times on the two datasets empirically confirms that ble has a time complexity that depends linearly on the maximum value of the LCP array, while BEETL and BEETL2 depend only on the maximum length of the input strings. As expected, also egsa and egap show a dependency on the maximum value of the LCP array, as both are definitely faster on the random dataset than on the NA24385 dataset (for egap roughly $2\times$ , for egsa from $6\times$ to $15\times$ ).

Table 2, reports the RAM usage (in Megabytes) required to compute the BWT and the LCP array. Just as for Table 1, the first column indicates the number of sequences in the datasets, whereas column $l$ indicates the maximum value of the LCP array on that dataset. As before, symbols $\star$ and $\diamond$ mean that the tool could not complete the execution in our environment since it exhausted the RAM or the available disk space, respectively. Notice that ble is always the tool requiring the smallest amount of memory, by a factor at least 6.

8 Conclusions

We have presented a new lightweight algorithm to compute the BWT and the LCP array of a set of $m$ strings, each $k$ characters long, based on applying a backward strategy for merging partial BWTs. More precisely, our algorithm has an $\mathcal{O}(mkl)$ time and I/O volume, and uses $\mathcal{O}(k+m)$ main memory to compute the BWT and LCP array, where $l$ is the maximum value in the LCP array. Our time complexity and I/O volume are in the worst case as those of the best previously available algorithms. The experimental analysis shows that our approach is competitive with the best available external-memory methods and that its advantage is noticeable on large inputs when the available RAM is limited.

The approach presented here may be further investigated in other research directions, as for example in the case of arbitrary alphabets and for collection of strings in other contexts, such as in dealing with dictionaries where the parameter $l$ may be smaller than the size of the input strings. Theoretically, it is of interest to investigate the open question whether the optimal time $\mathcal{O}(mk)$ time can be achieved for computing the BWT. Some recent results (see, for example, [48]) may also suggest that running times of BWT-related algorithms could be improved on compressible inputs. Investigating if these results have an implication to our algorithm is an interesting research direction.

On the other end, the prototype called ble implementing our approach is still a proof of concept, although carefully developed, and future releases of the tool could improve its performance by, for example, a better buffering strategy of the input and output files, asynchronous I/O, or better representation (i.e., fast compression) of the intermediate files.

Acknowledgements

We would like to thank Giovanni Manzini for the discussion on the implementation and on the comparison with the software egap. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 872539.

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Qin, et al., A human gut microbial gene catalogue established by metagenomic sequencing, Nature 464 (7285) (2010) 59–65. doi:10.1038/nature 08821 . · doi ↗
2[2] M. Burrows, D. J. Wheeler, A block-sorting lossless data compression algorithm, Tech. rep., Digital Systems Research Center (1994).
3[3] P. Ferragina, G. Manzini, Indexing compressed text, J. ACM 52 (4) (2005) 552–581. doi:10.1145/1082036.1082039 . · doi ↗
4[4] H. Li, Fast construction of FM-index for long sequence reads, Bioinformatics 30 (22) (2014) 3274–3275. doi:10.1093/bioinformatics/btu 541 . · doi ↗
5[5] G. Rosone, M. Sciortino, The Burrows–Wheeler transform between data compression and combinatorics on words, in: Ci E, Vol. 7921 of LNCS, 2013, pp. 353–364. doi:10.1007/978-3-642-39053-1_42 . · doi ↗
6[6] S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, An extension of the Burrows–Wheeler Transform and applications to sequence comparison and data compression, in: CPM, Vol. 3537 of LNCS, 2005, pp. 178–189. doi:10.1007/11496656_16 . · doi ↗
7[7] S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, An extension of the Burrows–Wheeler Transform, Theoretical Computer Science 387 (3) (2007) 298–312. doi:10.1016/j.tcs.2007.07.014 . · doi ↗
8[8] B. Langmead, C. Trapnell, M. Pop, S. Salzberg, Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome Biology 10 (3) (2009) R 25. doi:10.1186/gb-2009-10-3-r 25 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Computing the multi-string BWT and LCP array in external memory

Abstract

1 Introduction

2 Preliminaries

Definition 1**.**

3 The algorithm

3.1 Backward and forward strategies for merging the partial BWTs

Observation 1**.**

Observation 2**.**

4 Computing the partial BWTs

5 Backward strategy for computing the encoding IBI_{B}IB​ and the LCP array

Theorem 1**.**

Proof.

Definition 2**.**

Proposition 1**.**

Lemma 1**.**

Proof.

Theorem 2**.**

Proof.

5.1 Comparison with other strategies

6 Complexity

Proposition 2**.**

7 Results

8 Conclusions

Acknowledgements

Definition 1.

Observation 1.

Observation 2.

5 Backward strategy for computing the encoding $I_{B}$ and the LCP array

Theorem 1.

Definition 2.

Proposition 1.

Lemma 1.

Theorem 2.

Proposition 2.