TL;DR
This paper introduces an external memory algorithm for efficiently computing the BWT and LCP array of large string collections, crucial for genome assembly, using a novel backward approach that reduces memory and I/O requirements.
Contribution
It presents a new external memory algorithm employing a backward strategy to compute BWT and LCP arrays simultaneously for large string sets, improving efficiency over previous in-memory methods.
Findings
Algorithm runs in O(mkl) time and I/O volume for constant-length strings.
Uses only O(k + m) main memory, suitable for large datasets.
Effective for genome assembly and large-scale string indexing.
Abstract
Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multistring generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes. In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings ofā¦
| No.Ā of strings | Dataset: NA24385 | ||||||
|---|---|---|---|---|---|---|---|
| ble | BEETL2 | BEETL | gsa-is | egsa | egap | ||
| 1M | 148 | 26 | 12 | 11 | 28 | 60 | 1 |
| 2M | 148 | 53 | 25 | 23 | 239 | 18 | |
| 4M | 148 | 105 | 51 | 52 | 693 | 44 | |
| 8M | 148 | 213 | 122 | 124 | 2370 | 184 | |
| 16M | 148 | 414 | 241 | 227 | 448 | ||
| 32M | 148 | 855 | 633 | 915 | |||
| No.Ā of strings | Dataset: NA24385 | ||||||
|---|---|---|---|---|---|---|---|
| ble | BEETL2 | BEETL | gsa-is | egsa | egap | ||
| 1M | 148 | 6 | 35 | 255 | 747 | 764 | 728 |
| 2M | 148 | 10 | 67 | 453 | 770 | 760 | |
| 4M | 148 | 18 | 131 | 722 | 775 | 774 | |
| 8M | 148 | 34 | 255 | 710 | 773 | 772 | |
| 16M | 148 | 65 | 483 | 732 | 768 | ||
| 32M | 148 | 127 | 781 | 763 | |||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\publishers
DISCo, UniversitĆ degli Studi di MilanoāBicocca, Milan, Italy
āCorresponding author [email protected]
Computing the multi-string BWT and LCP array in external memory
Paola Bonizzoniā
āā
Gianluca Della Vedova
āā
Yuri Pirola
āā
Marco Previtali
āā
Raffaella Rizzi
Abstract
Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes.
In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of strings of different lengths. The algorithm over a set of strings having constant length has time and I/O volume, using main memory, where is the maximum value in the LCP array.
1 Introduction
In this paper we address the problem of constructing, simultaneously and in external memory, the Burrows-Wheeler Transform (BWT) and the Longest Common Prefix (LCP) array for a large collection of strings. The widespread use of Next-Generation Sequencing (NGS) technologies, that are producing everyday several terabytes of data that has to be analyzed, requires efficient strategies to index very large collections of strings. For example, common applications in metagenomics require indexing of collections of strings (reads) that are sampled from several genomes: those strings can easily contain more than characters. In fact, to start a catalogue of the human gut microbiome, more than 500GB of data have been usedĀ [1].
The Burrows-Wheeler Transform (BWT)Ā [2] is a reversible transformation of a text that was originally designed for text compression; it is used for example in the bzip2 program. The BWT of a text is a permutation of its symbols and is strictly related to the Suffix Array of . In fact, the i- symbol of the BWT is the symbol preceding the i- smallest suffix of according to the lexicographical sorting of the suffixes of . The BWT has gained importance beyond its initial purpose, and has become the basis for self-indexing structures such as the FM-indexĀ [3], which allows to efficiently perform important tasks such as searching a pattern in a textĀ [3, 4, 5]. The generalization of the BWT (and the FM-index) to a collection of strings was introduced inĀ [6, 7].
An entire generation of recent Bioinformatics tools heavily rely on the notion of BWT. Representing the reference genome with its FM-index is the basis of the most widely used aligners, such as BowtieĀ [8], BWAĀ [9, 10] and SOAP2Ā [11]. Still, to attack some other fundamental Bioinformatics problems, such as genome assembly, an all-against-all comparison among the input strings is needed, especially to find all prefix-suffix matches (or overlaps) between reads in the context of the Overlap-Layout-Consensus (OLC) approach based on string graphĀ [12]. This fact justifies the search for extremely efficient algorithms to compute the BWT on a collection of stringsĀ [13, 14, 15, 16]. For example, SGA (String Graph Assembler)Ā [17] is a de novo genome assembler that builds a string graph from the FM-index of the collection of input reads. In a preliminary version of SGAĀ [18], the authors estimated, for human sequencing data at a 20x coverage, the need of 700Gbytes of RAM in order to build the suffix array, using the construction algorithm inĀ [19], and the FM-index. Another technical device that is used to tackle the genome assembly in the OLC approach is the Longest Common Prefix (LCP) array of a collection of strings, which is instrumental to compute the prefix-suffix matches in the collection.
The construction of the BWT and LCP array of a huge collection of strings is a challenging task. A simple approach is constructing the BWT from the Suffix Array, but it is prohibitive for massive datasets mostly due to the main memory requirements. A first attempt to solve this problemĀ [20] partitions the input collection into batches, computes the BWT for each batch and then merges the results.
The huge amount of available biological data has stimulated the development of efficient external memory algorithms (called, BCR and BCRext) to construct the BWT of a collection of stringsĀ [21]. Similarly, a lightweight approach to the construction of the LCP array (called extLCP) was investigated inĀ [22]. With the ultimate goal of obtaining an external memory genome assembler, LSGĀ [23, 24] is based on BCRext and contains an external memory approach to compute the string graph of a set of reads. In that approach, external memory algorithms to compute the BWT and the LCP arrayĀ [16, 25] are fundamental.
In this context, we are considering a model of computation where memory is split into two parts: a finite random access memory, and an unlimited sequential access disk. Essentially, this model is an extension of the standard RAM model where we also have a sequential access disk.
In this paper we present a new lightweight (external memory) approach to compute the BWT and the LCP array of a collection of strings of different lengths, which is alternative to extLCPĀ [22] and other approachesĀ [26, 27, 28, 29, 30]. The literature is rich of in-memory methodsĀ [31, 32], as well as some external-memory algorithm on a single textĀ [31]. From a theoretical point of view, we can transform a set of strings into an instance consisting of a single text by concatenating the input strings after adding a distinct sentinel for each string. Anyway this would increase the alphabet size from to , and this effect must be taken into account.
The algorithm BCRext is proposed together with BCR and both are designed to work on huge collections of strings (the experimental analysis is on hundreds of millions of 100-long strings). Especially extLCP is lightweight because, on a collection of strings of length , it uses only RAM space and essentially CPU time, with matching I/O volume, under the usual assumption that the word size is sufficiently large to store all addresses.
An important question is how to achieve the optimal I/O volume. BCRextĀ [21] incrementally computes the BWT of the collection via iterations. At each iteration () the algorithm computes a partial BWT that is the BWT for the ordered collection of suffixes with length at most . This approach requires that, at each iteration , the symbols preceding the suffixes of with length must be inserted at their correct positions into , that is each iteration simulates the insertion of the suffixes with length in the ordered collection of the suffixes with length at most . Updating the partial BWT in external memory, the process requires a sequential scan of the file containing the information of the partial . Thus the I/O volume at each iteration is at least (since there are suffixes for each length between to ). Consequently the total I/O volume for computing is at least . More precisely, the BCRext algorithm in [21] that uses less RAM, requires at each iteration an additional I/O volume given by , due to a process of ordering special arrays used to save RAM space. Our algorithm for building the BWT and the LCP, differently fromĀ [21], consists of two distinct phases: the first phase that has I/O volume and time complexity produces arrays , each array lists the symbols preceding the suffixes with length exactly according to the lexicographical ordering of such suffixes. The second phase computes the interleave of vectors that is equal to the BWT of . Indeed, the BWT is an interleave of the arrays , since the ordering of symbols in is preserved in the final BWT , i.e., is stable w.r.t.Ā each array . Inspired byĀ [27], we perform this step by a number of iterations, where the length of the longest substring that has at least two occurrences in . Thus the merging operation takes fewer iterations than BCRext (the latter requires iterations). Observe that at each iteration of the merging procedure of the arrays , a partial LCP array is computed to get the final LCP array at the last iteration.
Our algorithm has time complexity, uses I/O volume, and main memory, where is the maximum value in the LCP array and is the space required to store a memory address. Moreover, our approach is entirely based on linear scans (i.e., it does not contain a sorting step) which makes it more amenable to actual disk-based implementations. We point out that , therefore our time and I/O complexities are at least as good as those of extLCPĀ [22] when building the data structures for massive sets of short sequences over a constant alphabet and if and are smaller than the word size (which is usually the case). The RAM usage of our approach and that of extLCP are not easily comparable, since they are respectively and . If we suppose that we can store a memory address in a memory word, our RAM usage is . This means that, in theory, when building the BWT and the LCP of few large strings extLCP will use less RAM than the method presented in this paper. We point out that our algorithm works also on a set of reads having different lengths, and the following sections describe the algorithm referring to that case.
While writing our paper, two similar approaches have appeared in the literatureĀ [28, 30]. The method proposed inĀ [28] starts from the BWT merging phase ofĀ [27] to also build the LCP array using a small amount of memory. We point out that our paper andĀ [28] are two independent works. Moreover, our focus is on an external-memory approach, which is not explicitly pursued inĀ [28]. An extension to fully external memory computation of BWT and LCP ofĀ [28] is egapĀ [33]. This method computes the data structures in three separate steps, (1) splitting the input sequences in subcollections such that the BWT can be computed in-memory, and (2-3) then merging them building the LCP along the way. The I/O and time complexities of egap are both , matching the ones of our algorithm when the alphabet is constant.
The method proposed inĀ [30] (egsa) is a two-phase algorithm for the construction of the Generalized Enhanced Suffix Array, the LCP, and, optionally, the BWT. In the first step the required data structures are build for each sequence in input, whereas in the second step the output is produced by merging the data structures built previously. Although egap can be seen as an evolution of egsa, we included both tools in our experimental evaluation to highlight that building the Generalized Enhanced Suffix Array requires way more resources than directly building the BWT and the LCP.
The paper is laid out as follows. In SectionĀ 2 we provide the basic definitions we will use in the following. In SectionĀ 3 we give a high level description of the method proposed in this paper and illustrate the backward and forward strategies to merge partial arrays. In SectionĀ 4 and SectionĀ 5 we dive into the details of our algorithm. In SectionĀ 6 we analyze the time and I/O complexities of our method. In SectionĀ 7 we provide an experimental analysis of our tool and a comparison with other tools available in the literature. Finally, in SectionĀ 8 we recap the contributions of this paper.
2 Preliminaries
Let be a finite alphabet where c_{0}=\$$ (called *sentinel*), and c_{0}<c_{1}<\cdots<c_{\sigma}<\SigmaS={s_{1},s_{2},\ldots,s_{m}}ms_{j}k_{j}\Sigma\setminus{$}, such that is the total length of . The set is intended as a sequence of strings, where is the j- string in the set. The -th symbol of string is denoted by , and the substring of is denoted by . We will refer to as the last character of the string and is the character immediately before the sentinel. The suffix and prefix of with length are the substrings (denoted by ) and (denoted by ) respectively. Observe that counts, for the suffix, only the characters which are in \Sigma\setminus\{\}). Then, the suffix and prefix with length of a string will be called the -suffix and -prefix of , respectively. In particular, the [math]-suffix is the suffix uniquely composed of the sentinel K).
Given the lexicographic ordering of the suffixes of , the Suffix Array is the -long array where the element is equal to if and only if the -th element of is the -suffix of string . We make the assumption that a suffix s\$$ from string s_{i}s$s_{j}i<jSS$.
The definition of suffix array we provide is slightly different than the one conventionally used, where refers to the suffix of the string starting at position . We have decided to abide by this definition to ease the presentation of the method in the following sections.
The Burrows-Wheeler Transform (BWT) of is the -long array where if , then is the first symbol of the -suffix of if , otherwise B[i]=\$$. In other words BX when the suffix is the complete string (i.e., the -suffix).
The Longest Common Prefix (LCP) array of is the -long array such that is the length of the longest prefix shared by suffixes and . Conventionally, .
Now, we can give the definition of Interleave of a generic set of arrays, that will be used extensively in the following.
Definition 1**.**
Given arrays , then an array is an interleave of if is the result of merging the arrays such that: (i) there is a 1-to-1 function from the set to the set , (ii) for each , and (iii) for each .
The interleave is an array giving a fusion of which preserves the relative order of the elements in each one of the arrays. As a consequence, for each with , the j- element of corresponds to the j- occurrence in of an element of . This fact allows to encode the function as an array such that if and only if is an element of . By observing that is equal to where is the number of integers equal to in , it is easy to show how to reconstruct from (see AlgorithmĀ 1 where the array keeps, for each index from [math] to , such number while scanning array ).
In the following, we will refer to array as interleave-encoding (or simply encoding). FigureĀ 1 shows an example of an interleave of four arrays and its encoding.
3 The algorithm
In this section we will provide a sketch of our algorithm. Let be the maximum length of a string in (excluding X_{l}B_{l}0\leq l\leq kmX_{l}[i]{th}llSB_{l}[i]X_{l}[i]X_{0}B_{0}SB_{l}BSBk+1B_{0},B_{1},\ldots,B_{k}B_{l}0\leq l\leq kB$.
Similarly, the lexicographic ordering of all suffixes of is an interleave of the arrays . Let be the encoding of the interleave of arrays giving the BWT , and let be the encoding of the interleave of arrays giving . Then it is possible to show that . Now, given it is immediate to reconstruct by using AlgorithmĀ 1.
In the following, we will call and as partial BWTs and partial Suffix Arrays, respectively. FigureĀ 2 shows an example of partial BWTs and partial Suffix Arrays for a set of reads on alphabet , whose interleaves and (BWT and sorted suffixes, respectively) and the encoding are reported in the first, second and third columns of FigureĀ 3.
Our algorithm for building the BWT and the LCP array consists of two distinct phases: in the first phase it computes each partial BWT () by implicitly sorting the -suffixes of (see SectionĀ 4), while in the second phase it determines (see SectionĀ 5) by a merging algorithm inspired byĀ [27] (for merging two BWTs), thus allowing to reconstruct as an interleave of . We slightly modified the approach inĀ [27] in order to merge the arrays into the BWT by implicitly merging the array into the array (giving the lexicographic ordering of all suffixes of ). The second phase computes, together with the BWT , also the LCP array of the input set .
We note that the definition of partial BWTs and the method sketched here hint to some relation between the partial BWTs and the positional BWT (pBWT) presented inĀ [34], although the latter is presented for an alphabet of size . Indeed, given a set of sequences, both reorder the characters at a given distance from one end of each sequence in input. More precisely, each partial BWT is an ordering of all the elements of the sequences at a given distance from the end of them, whereas each column of the pBWT is an ordering of all the elements at a given distance from the start of the sequences. In light of this fact, we can describe the two steps sketched in this section as follows: (i) build the pBWT of the input sequences reversed, and (ii) build the BWT and the LCP array by merging the columns of the pBWT. Although we will not describe our method in terms of pBWT in the following sections, we think that the connection we just highlighted further confirms strong relations between multiple BWT-like data structures presented thorough the years to index different structures (e.g., treesĀ [35], de Bruijn graphsĀ [36, 37, 38], and circular patternsĀ [39]), as recently shown inĀ [40].
Both phases of our method apply a Radix Sort strategy to reorder the suffixes (i.e., the -suffixes of in order to compute the partial BWT in the first phase, and the overall set of suffixes of in the second phase in order to compute ). The first phase iteratively computes the partial BWTs . Each iteration () computes from the order of the -suffixes (array ) implicitly computed by the previous iteration (array ). We point out that this algorithm adopts a LSD Radix Sort strategy that can be interpreted as āglobalā, since suffixes are sorted from the rightmost to the leftmost character (that is, it adopts a LSD strategy), and the order of is implicitly obtained from the order of without applying the radix sort to each one of the sets of -suffixes.
The second phase applies a MSD Radix Sort strategy since it reorders the suffixes from the leftmost to the rightmost characters, and can be performed in two different ways as described in the following section.
3.1 Backward and forward strategies for merging the partial BWTs
The encoding is basically computed by an iterative procedure starting by the trivial sorting given by taking first the suffixes of , followed by the suffixes of , followed by the suffixes of , etc., followed by the suffixes of (trivial interleave). Note that the encoding of the trivial interleave is given by runs of the integers from [math] to : that is, integers equal to [math], followed by integers equal to , etc., followed by integers equal to . Starting from that sorting, the procedure applies a MSD Radix Sort strategy to sort the suffixes of , by the first (leftmost) characters at the first iteration, then by the first two characters at the second iteration, etc., and finally by the first characters ( is the maximum length of the strings in the input set ) at the k- iteration. More precisely, at the p- iteration, it computes the encoding of the interleave, giving the sorting by the first characters, from the interleave giving the sorting by the first characters (computed at the previous iteration). At the k- iteration the computed encoding is clearly .
In the following, the interleave of arrays , giving the sorting of the suffixes by the first characters will be called -interleave and denoted as , and its encoding will be denoted as . The encoding is clearly equal to (and is equal to ). We point out that is the list of all the suffixes in the input collection sorted by their prefixes of length . In other words, includes also suffixes shorter than . In this ordering, a suffix s\$$ shorter than psX^{q}q>p$.
Iteration computes the encoding from the encoding obtained at the iteration . The first iteration computes from which is the trivial encoding composed of runs of the integers from [math] to ( is the encoding of the [math]-interleave giving trivially the suffixes sorted by the first [math] characters).
Two different strategies can be used for computing from , which are based on the two following observations.
Observation 1**.**
If with length is the -th suffix preceded by a symbol , then the suffix with length will be the -th suffix in starting with . Therefore, will be equal to , such that , and is the number of symbols preceding suffixes of which are smaller than . Observe that, when c=\$$, then cX^{p-1}[i]s$ is equal to [math].
Observation 2**.**
Let be the interval of positions related to the suffixes of sharing the first characters, and (among them) let us consider the -th suffix having a given at position . Then, such suffix will be at position of (), where is the number of suffixes in the interval having a symbol smaller than at position . Therefore, will be equal to .
The first strategy (backward) is based on ObservationĀ 1 and consists in scanning the encoding . empty buckets are initialized (one for each alphabet symbol), and for each length , the symbol preceding the related suffix is obtained from the partial BWTs ( is indeed the -th symbol of the interleave of the partial BWTs encoded by , and will be the -th of vector if suffix is the -th suffix in having length ). At this point, the length is added to the bucket related to symbol , if is not the sentinel bucket). At the end of the iterations, the concatenation of the buckets following the lexicographical order of the symbols, provides the encoding .
The second strategy (forward) is based on ObservationĀ 2 and maintains a partitioning of the generic encoding into contiguous segments, which are called -segments. A -segment is an interval of positions which are related to suffixes sharing the first characters. The forward strategy consists in scanning the -segments of the encoding one after the other and uses initially empty buckets. For each -segment , its suffixes are considered, and each suffix length is added to a bucket depending on the symbol at position in the suffix. At the end of the iterations over the -segment, the concatenation of the buckets following the lexicographical order of the symbols, provides the encoding between positions and .
Both algorithms compute the interleave giving the Suffix Array as defined in Section 2 whose uniqueness is guaranteed by the radix sort strategy.
FigureĀ 4 shows an example for the two strategies applied to a set of three reads.
In SectionĀ 5 the backward strategy will be detailed alongside the computation of the LCP array. We refer toĀ [41, 42] for the details about the forward strategy.
4 Computing the partial BWTs
The first phase of the method computes the partial BWTs by first preprocessing the input strings in order to obtain arrays with length , where lists the characters such that when , T_{l}[i]=\$$ when l=|s_{i}|T_{l}[i]=#l>|s_{i}|#is a padding symbol not belonging to the alphabet of the input strings). Section (a) of FigureĀ [5](#S4.F5) reports an example of arraysT_{l}for the three strings of FigureĀ [2](#S3.F2). Observe thatT_{0}\langle s_{1}[k_{1}],s_{2}[k_{2}],\ldots,s_{m}[k_{m}]\rangleST_{0}B_{0}T_{l}[i]#ls_{i}$.
The preprocessing step is a trivial task that iterates over the input strings and outputs the arrays . We can summarize this procedure as a loop that iterates over the input strings and performs the following steps. Let be the input string and suppose that we already preprocessed the previous sequences. We first reverse and produce the string . Then, for each position of , we write the -th character of at position of array , padding this array with # if it includes less than elements. Finally, we write \$$ at position iT_{|s_{i}|}$ (padding it with # if required) and move to the next sequence.
The partial BWTs are computed by AlgorithmĀ 2 which receives in input the arrays , and performs iterations. Iteration () computes from which is implicitly known and implicitly determines to be used in the next iteration. More in detail, the ordering of array is known by means of a array , with length at most , such that if and only if the i- element of is the -suffix from the input string . In other words, position of array gives the index of the string whose -suffix is the i- in . The partial BWT can be computed (see cycle at lineĀ 2) by scanning , since is (by definition) the symbol preceding the -suffix of the string , where the index is equal to , and can be obtained by accessing array . Indeed, . Observe that is treated by AlgorithmĀ 2 as a list initially empty, and the symbol is appended to only if it is not the padding symbol # (signaling that the originating string is shorter than ). Note that, at the first iteration , (which is set in cycle at lineĀ 2) is the sequence of indices , and is correctly computed as the sequence of the last characters (i.e., ).
At the same time, the iteration computes the array to be used in the next iteration in order to compute the partial BWT . Observe in fact that the i- -suffix of is preceded by the symbol , where , and belongs to string . Assuming that the i- suffix of is the h- suffix of which is preceded by that symbol , then the -suffix of is the h- suffix of starting with . Furthermore, let us assume that there are -suffixes of starting with a symbol smaller . Then the -suffix of is the (r+h)- suffix of . By definition, it holds . The algorithm uses lists , a list for each symbol in , which are created at the beginning of iteration . During the scanning of the index is added to the list .
It is easy to prove that, at the end of iteration , the concatenation of lists (according to the order of the symbols in ) correctly gives . Note that is computed also by the last iteration , even though it is actually not used. FigureĀ 5 exemplifies the iteration of AlgorithmĀ 2 which computes, for the set of reads of FigureĀ 2, the partial BWT (see cycle for at lineĀ 2) and the array (lineĀ 2), from the array (computed by the previous iteration ). Array will be used by the next iteration for computing the partial BWT .
Observe that arrays must be kept in main memory, since they are not accessed sequentially (see Algorithm 5), and for this reason they cannot be stored in external memory.
5 Backward strategy for computing the encoding and the LCP array
This section is devoted to describe the second step of our algorithm which computes the BWT and the LCP array according to the backward strategy described in SectionĀ 3.
First of all, we describe in detail how the single iteration works (see AlgorithmĀ 3). Then, we show how to enrich AlgorithmĀ 3 in order to compute also the LCP array together with the encoding (see AlgorithmĀ 4). Finally, the complete procedure for computing , from the partial BWTs , is presented (see AlgorithmĀ 5) and is explained how to use the LCP array values in order to limit the iterations to the number strictly necessary to obtain .
At this point, let us assume to have (iteration ) the encoding of the -interleave . We want to compute the encoding of the -interleave , by sorting the suffixes of by the first characters. The algorithm implicitly obtains (suffixes sorted by the first characters) by implicitly reordering the characters preceding each one of the suffixes of (suffixes sorted by the first characters). We note that (by definition) for any from [math] to the first entries of are all equal to [math]. Indeed, the [math]-suffixes (of the set ) occupy always the first positions for any value of .
Before entering the details of iteration (see AlgorithmĀ 3), we give the idea of the algorithm. Let us consider the suffix whose length is . Let be the symbol preceding such suffix. Let be the subset of suffixes of preceded by a symbol smaller than , and let be the subset of suffixes at a position of preceded by the symbol . It is easy to show that the suffix (with length ) is greater (by the first characters) than all and only those suffixes , such that and is the symbol preceding . Therefore, the suffix (with length ) will occupy in the position , and will be equal to . In other words, position of suffix on is given by the sum where is the number of suffixes of preceded by a symbol smaller than and is the number of suffixes, which are before and preceded by symbol .
AlgorithmĀ 3 creates a set of lists containing at the end of iteration the partitioning of the encoding of by the first character of the suffixes of . Since the list (we recall that c_{0}=\$$) is related to the [math]-suffixes, then it is fixed over the iterations pmI_{X^{p}}(see lineĀ [3](#algorithm3)) as the concatenation\mathcal{I}(c_{0})\mathcal{I}(c_{1})\cdots\mathcal{I}(c_{\sigma})\mathcal{I}(\cdot), AlgorithmĀ [3](#algorithm3) performs a scan of I_{X^{p-1}}ql=I_{X^{p-1}}[q]qthX^{p-1}cpreceding such suffix (see lineĀ [3](#algorithm3)). VectorposcB_{l}c\neq$lX^{p-1}[q]l+1\mathcal{I}(c)c=c_{0}=$q+1lX^{p-1}[q]cX^{p-1}[q]\mathcal{I}(c_{0})$, which is fixed (by definition) over the iterations.
This approach is alternative to the one presented inĀ [26] first and then implemented in [41]. In fact, the iteration is a backward extension of the suffixes sorted by the first characters in order to obtain the suffixes sorted by the first characters. Instead the strategy presented inĀ [26] is based on a forward extension of the -prefixes of the suffixes in order to obtain the ordering given by the encoding .
The following theorem proves the correctness of AlgorithmĀ 3.
Theorem 1**.**
If AlgorithmĀ 3 receives in input the encoding of the -interleave , then it computes the encoding of the -interleave .
Proof.
Observe that the -prefix (prefix with length ) of the i- suffix of is the suffix of the -prefix of a suffix of , starting with the symbol . Then, line 7 of AlgorithmĀ 3 appends length to the list . Observe that lineĀ 3 implicitly computes a partitioning of the suffixes in , according to their starting symbol, into lists , where gives the ordering, by the first characters, of the suffixes starting with symbol . Each list contains (at lineĀ 3) the lengths of such suffixes.
Furthermore, given two distinct suffixes and such that is smaller (by the first characters) than , either they begin with two different symbols , or they both start with the same symbol, i.e., . Let and be the partitions of containing the suffixes starting with and (respectively). Then, in all suffixes in precede those in . Inside the list , the ordering of two suffixes and by the first characters is the same as in . Indeed, is lexicographically smaller than if and only if is lexicographically smaller than . It follows that consists of the concatenation of according to the lexicographic ordering of symbols of alphabet , and thus lineĀ 3 of AlgorithmĀ 3 computes the encoding of . ā
In the following we will describe how to compute the LCP array of the input dataset. Similarly to the computation of the BWT , the LCP array will be constructed iteratively. More precisely, the LCP array will be constructed by considering prefixes of the suffixes by increasing length. At this point, we can describe how to update AlgorithmĀ 3 (iteration ) in order to compute (at the end of the iterations) also the LCP array.
To this aim we must introduce the following definition.
Definition 2**.**
Given the LCP array, is defined such that .
Observe that is the length of the longest prefix shared by the -prefix of and the -prefix of . We note that, when a suffix in is shorter than , then its -prefix (considered for ) is the whole suffix itself ($ excluded).
The array is equal to the LCP array of the input set , and contains all [math]s, except for that is equal to . In FigureĀ 7 and are reported for the input set of FigureĀ 2.
The LCP array is computed iteratively by starting from . Now we describe the single iteration for computing from . AlgorithmĀ 4 extends AlgorithmĀ 3 in order to compute and from and .
AlgorithmĀ 4 builds a set of lists containing the partitioning of the elements of by the first character () of the related suffix. Since the list \mathcal{L}(c_{0}=\)-1m-1\mathcal{L}(c_{i})1\leq i\leq\sigma) is always [math]. Finally, AlgorithmĀ [4](#algorithm4) concatenates all the lists \mathcal{L}(c_{0}),\mathcal{L}(c_{1}),\ldots,\mathcal{L}(c_{\sigma})\mathit{LCP}_{p}$ (see lineĀ 4).
Before giving the detail of computing the single lists , we need to introduce the following function. Given a position and a symbol c\neq\$$, the function \alpha_{p}(q,c)pX^{p}[q]X^{p}[h]hqX^{p}[h]ch\alpha_{p}(q,c)=-1$.
In the following, given two strings , we denote (respectively) by and the length of the longest common prefix between the -prefixes of and , and the length of the longest common prefix between and (that is, ). The following proposition relates the values of and and it is a direct consequence of their definitions.
Proposition 1**.**
Let and be two consecutive suffixes of , and let be the q- suffix of . Then .
During the scan of the encoding , the value is obtained (see line 13 of Algorithm 4). The function is maintained in the array of size initially set to values s, and updated in the cycle at lineĀ 4. The main invariant of AlgorithmĀ 4 is that, at lineĀ 4, the variable is equal to āthis invariant is a consequence of the following LemmaĀ 1 and can be proved by a direct inspection of AlgorithmĀ 4. The value incremented by is appended to the list .
Lemma 1**.**
Let and be respectively the j- and the q- suffixes of , such that , and let be the symbol preceding suffix . If every suffix at a position between and (), is not preceded by the symbol , then it holds that .
Proof.
Since is not the symbol that precedes the suffix at position Ā with , then by definition of , it must be that , since the is the largest integer less than for which the j- suffix is preceded by symbol . Since it is immediate to verify that , the lemma easily follows. ā
The previous argument allows us to prove the following theorem which, combined with TheoremĀ 1 completes the correctness of AlgorithmĀ 4.
Theorem 2**.**
Given as input and the partial BWTs , AlgorithmĀ 4 computes .
Proof.
Observe that at lineĀ 4 iff the current suffix at position is not the first to be preceded by the character , hence we must append the value to . Since , the theorem is proved. ā
In FigureĀ 6 the computation of from (by AlgorithmĀ 3 for ) is shown for the set of reads presented in FigureĀ 2. The encodings and are reported in FigureĀ 7 together with and whose computation (by AlgorithmĀ 4) has been omitted for simplicity.
The procedure BWT+LCP (see AlgorithmĀ 5) computes and , which are the encoding of the BWT and the LCP array of the input set of strings, by iterating AlgorithmĀ 4. Iterations stop when the maximum value in the array is less than . In fact, it means that for an iteration , the values and do not change since the suffixes have been fully sorted and thus and remain equal to and , respectively. The correctness of the procedure BWT+LCP is a consequence of TheoremĀ 2 and DefinitionĀ 2. Observe that if the maximum value in the LCP array is equal to , then at each iteration of AlgorithmĀ 5 with , the maximum value in is , in virtue of TheoremĀ 2 and DefinitionĀ 2. When , then by DefinitionĀ 2, the iteration gives value , that is . Then the suffixes have been fully sorted and the LCP array has been computed at the previous step .
Observe that, in virtue of the radix sort strategy, the two steps of our method (computing the partial BWTs and computing the interleave ) do not depend on the particular order of the strings in the input set . For this reason, there is no particular order of the input strings which may improve the computation.
5.1 Comparison with other strategies
While a common element of our method with egap and BEETL is the use of a radix sort strategy, a main difference is represented by the collection of objects to which it is applied. BEETLās algorithm works by a unique step and is based on the following invariant: at the iteration , it computes the partial BWT for the collection of suffixes of length at most . Differently, our algorithm works by two steps: first it computes the partial BWTs (as previously defined) and then the interleave . In the second step the following invariant is maintained: at the iteration , it computes the list of the symbols preceding all the suffixes in the input collection sorted by the -long prefixes. As a consequence, at the iteration , it computes a permutation of the BWT for tending to the solution over the iterations, while BEETL computes a subsequence of the BWT for and maintains over the iterations the reciprocal order between the symbols.
Arrays used by the first step of our algorithm (computing the partial BWTs ) are the same as arrays used in [16]. Indeed, in our case is the position in of the string which is the origin of the -th -suffix whose preceding symbol is . Arrays in [16] are defined such that is the position in , of the string which is the origin of the -th -suffix (in the partial BWT) starting with the -th symbol of the alphabet. We note that the concatenation gives the array of our algorithm.
Observe that both our algorithm and egap use the notion of an interleave in order to compute the BWT and the LCP array. More precisely, egap splits the input collection into subcollections sufficiently small, then it computes the BWT (partial BWTs) for each subcollection and finally it merges the BWTs similarly to the approach inĀ [27]. On the other hand, our algorithm first computes the partial BWTs from the whole collection , that are then merged maintaining the invariant property described above.
6 Complexity
In this section we will analyze the computational and I/O volume of our algorithm.
First we will analyze AlgorithmĀ 2. This procedure mainly consists of two nested loops in which each operation requires constant time. If the input is a set of strings of length , the time complexity of it is . Note that each of the lists and have elements which are read or written sequentially and, moreover, each list is read only once per execution. Hence, the I/O volume of AlgorithmĀ 2 is since, for each element in , AlgorithmĀ 2 appends an integer less than to the correct list that we can store on disk, since we access them sequentially.
Besides some -space data structures, the algorithm uses lists to store pointers to the open files and arrays to store the characters of the sequences. Note that, at each iteration of the loop at lineĀ 2, only one array must be kept in main memory, since we need to perform non-sequential accesses, and requires bitsānotice that for one million DNA reads, that translates to 256 Mbytes of memory, which is well below the RAM amount found in standard PCs. Therefore, if we can address each file using bits, the main memory requirement of AlgorithmĀ 2 is bits.
Furthermore, arrays () can be computed in time and I/O volume by reading sequentially input strings at a time and producing positions of arrays , where is the disk block size, that is the number of characters that are read or written in a single disk operation (see [43]). Notice that this step requires to keep characters in main memory: this is not a problem for Bioinformatics applications, since short reads are at most a few hundreds of characters long, and even longer reads are at most 20000 characters long. Anyway, it is possible to adapt the algorithm ofĀ [44, 45] to compute the arrays arrays with main memory.
We will now analyze AlgorithmĀ 4. The time complexity of this procedure is since such procedure is composed of a for loop that iterates over the encoding āwhose length is āperforming constant time operations per element except for the loop at linesĀ 4ā4 that requires time.
The I/O volume is bits, since each iteration of the loop at linesĀ 4ā4 requires to read and write a constant number of elements of some lists whose values are bounded by or , and since is kept in main memory. The main memory usage is bits, since we store integers smaller than in and pointers to the lists .
We can now analyze AlgorithmĀ 5, which is composed of two main steps: in the first one it prepares the input data structures (lineĀ 5), invokes AlgorithmĀ 2, and initializes some data structures. In the second part (linesĀ 5ā5) it computes the final encoding and the LCP array from the structures computed at the previous step by iteratively applying AlgorithmĀ 4.
The complexity of the first part is essentially that of AlgorithmĀ 2, since computing the lists (lineĀ 5) requires with a single scan of the input data (whose size is ), while outputting the lists requires constant time per element.
The second step is mainly composed of a while loop that iteratively applies AlgorithmĀ 4 (that requires ) to compute the final interleave and the final LCP array. Moreover, the proof of correctness of AlgorithmĀ 5 also shows that AlgorithmĀ 4 is applied times, where is the largest value in the LCP array.
Finally, AlgorithmĀ 5 builds the final BWT from and the lists by a single scan of those -long lists, which requires time overall. Therefore, AlgorithmĀ 5 requires an overall time.
The I/O complexity of the first step is bits whereas the main memory requirement is bits. Indeed, computing the lists at lineĀ 5 requires us to store only one character per time of each sequence and to append it to the correct list: therefore it has bits I/O volume and bits main memory requirement. We have to include the requirements of AlgorithmĀ 2, which changes the main memory needed for the first step to bits.
The I/O volume of the second step is bits since it consists essentially of applications of AlgorithmĀ 4. Finally, while building the final BWT from , AlgorithmĀ 5 reads bits due to the interleave and bits due to the partial BWTs, writes bits for the final BWT and requires bits of main memory since at most it stores in main memory one element of and one element of a partial BWT.
Therefore, overall AlgorithmĀ 5 reads and writes bits from and to the disk and requires bits of main memory. We can summarize our results as follows.
Proposition 2**.**
Given as input a set composed of strings of length over and alphabet of size , the procedure BWT+LCP computes the BWT and the LCP array of it in time, where is the maximal value of the LCP array. This procedure requires to store in main memory bits and reads and writes from and to the disk bits.
Note that, if is constant then the time complexity of the method presented in this paper becomes . Moreover, if the word size is then its I/O volume and main memory requirement become and respectively.
7 Results
We implemented the method proposed in this article in a prototype in C, named bwt-lcp-em (we will refer to it as ble in the following) that is freely available at https://github.com/AlgoLab/bwt-lcp-em. We compared our method with other tools specifically designed to index datasets composed by a huge number of short sequences such as Next-Generation Sequencing read sets.
We have compared ble with the original implementation of extLCPĀ [22] (BEETL), as well as a more recent version (BEETL2) that implements a fully external memory approach111The second version is available at https://github.com/giovannarosone/BCR_LCP_GSA., the in-memory method gsa-isĀ [46], and two recent external memory tools (egsaĀ [30] and egapĀ [33]). Notice that gsa-is and egsa have been designed to compute the suffix array of a set of strings, so the computation of the BWT and of the LCP array is likely not optimized.
We used some non-default values for some of the parameters in order to minimize main memory usage: egap has been run with --lbytes 1, BEETL with --memory-limit=900, and egsa by compiling with MEMLIMIT=900. We allowed egap to use 95% of the available RAM, as suggested in its website.
We compared the considered tools in the scenario of 1GB of main memory available, by considering instances with 1, 2, 4, 8, 16, and 32 million sequences, taken from two different data sources: (a) 148bp Illumina reads from the Genome In A Bottle (GIAB)Ā [47] consortium, more precisely from the NA24385 individual; (b) random sequences of length generated by a Python script that builds uniformly distributed fixed-length sequences over the DNA alphabet (for the extended and detailed experimentation results we refer the read to the GitHub repository of the implementation). The goal of using these two datasets is to experimentally assess the theoretical time complexity of our approach that shows a dependency on , i.e., the maximum value of the LCP array (see SectionĀ 6). We expect that in the random datasets will be considerably less than the length of input strings. On the other hand, the Illumina datasets represent the worst case of our approach as they have a 300 mean coverage. Hence they will surely contain duplicate reads and will be equal to the length of the input strings.
We ran all the experiments on the same workstation running Ubuntu Linux 18.04 equipped with an Intel Core i7-4770 CPU running at 3.40GHz and a 256GB solid state disk. The machine is equipped with 8GB of RAM, and we limited the amount of RAM at boot time to 1GB to avoid the effects of OS caching.
TableĀ 1 reports the time (in minutes) required to compute the BWT and the LCP array. The first column indicates the number of sequences in the dataset, whereas column indicates the maximum value of the LCP array on that dataset. Symbol means that the tool could not complete the execution in our environment because it needed more than 1GB of RAM, while symbol means that the tool could not complete the execution since it required more disk space than that available. Notice that BEETL and gsa-is did not complete some executions because of the RAM limit, while egsa required more disk space than that available. To better highlight the trends, FigureĀ 8 visually depicts the same results presented in the table. We expect that this experiment will show the advantages of memory-conscious approaches.
The results point out that only ble, BEETL, BEETL2, and egap were able to deal with such a limited amount of main memory. Moreover, ble, BEETL2, and egap were able to compute the BWT and LCP array for all the datasets, while gsa-is and egsa could only cope with smaller datasets. Since gsa-is is an in-memory approach, its memory requirements made impossible to process even moderately large instances.
On the NA24385 dataset, egap is the fastest tool on the instances up to 8M reads, while BEETL2 is the fastest on larger instances. Still, the trend in the running time hints that ble will likely become the fastest tool on instances larger than those considered here.
On the random dataset, ble is the fastest tool on all instances with at least 2 million readsāegap is the fastest on 1 million reads. egap and ble are always the two fastest tools.
The comparison of the running times on the two datasets empirically confirms that ble has a time complexity that depends linearly on the maximum value of the LCP array, while BEETL and BEETL2 depend only on the maximum length of the input strings. As expected, also egsa and egap show a dependency on the maximum value of the LCP array, as both are definitely faster on the random dataset than on the NA24385 dataset (for egap roughly , for egsa from to ).
TableĀ 2, reports the RAM usage (in Megabytes) required to compute the BWT and the LCP array. Just as for TableĀ 1, the first column indicates the number of sequences in the datasets, whereas column indicates the maximum value of the LCP array on that dataset. As before, symbols and mean that the tool could not complete the execution in our environment since it exhausted the RAM or the available disk space, respectively. Notice that ble is always the tool requiring the smallest amount of memory, by a factor at least 6.
8 Conclusions
We have presented a new lightweight algorithm to compute the BWT and the LCP array of a set of strings, each characters long, based on applying a backward strategy for merging partial BWTs. More precisely, our algorithm has an time and I/O volume, and uses main memory to compute the BWT and LCP array, where is the maximum value in the LCP array. Our time complexity and I/O volume are in the worst case as those of the best previously available algorithms. The experimental analysis shows that our approach is competitive with the best available external-memory methods and that its advantage is noticeable on large inputs when the available RAM is limited.
The approach presented here may be further investigated in other research directions, as for example in the case of arbitrary alphabets and for collection of strings in other contexts, such as in dealing with dictionaries where the parameter may be smaller than the size of the input strings. Theoretically, it is of interest to investigate the open question whether the optimal time time can be achieved for computing the BWT. Some recent results (see, for example, [48]) may also suggest that running times of BWT-related algorithms could be improved on compressible inputs. Investigating if these results have an implication to our algorithm is an interesting research direction.
On the other end, the prototype called ble implementing our approach is still a proof of concept, although carefully developed, and future releases of the tool could improve its performance by, for example, a better buffering strategy of the input and output files, asynchronous I/O, or better representation (i.e., fast compression) of the intermediate files.
Acknowledgements
We would like to thank Giovanni Manzini for the discussion on the implementation and on the comparison with the software egap. This project has received funding from the European Unionās Horizon 2020 research and innovation programme under the Marie SkÅodowska-Curie grant agreement No 872539.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Qin, et al., A human gut microbial gene catalogue established by metagenomic sequencing, Nature 464 (7285) (2010) 59ā65. doi:10.1038/nature 08821 . Ā· doiĀ ā
- 2[2] M. Burrows, D. J. Wheeler, A block-sorting lossless data compression algorithm, Tech. rep., Digital Systems Research Center (1994).
- 3[3] P. Ferragina, G. Manzini, Indexing compressed text, J. ACM 52 (4) (2005) 552ā581. doi:10.1145/1082036.1082039 . Ā· doiĀ ā
- 4[4] H. Li, Fast construction of FM-index for long sequence reads, Bioinformatics 30 (22) (2014) 3274ā3275. doi:10.1093/bioinformatics/btu 541 . Ā· doiĀ ā
- 5[5] G. Rosone, M. Sciortino, The BurrowsāWheeler transform between data compression and combinatorics on words, in: Ci E, Vol. 7921 of LNCS, 2013, pp. 353ā364. doi:10.1007/978-3-642-39053-1_42 . Ā· doiĀ ā
- 6[6] S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, An extension of the BurrowsāWheeler Transform and applications to sequence comparison and data compression, in: CPM, Vol. 3537 of LNCS, 2005, pp. 178ā189. doi:10.1007/11496656_16 . Ā· doiĀ ā
- 7[7] S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, An extension of the BurrowsāWheeler Transform, Theoretical Computer Science 387 (3) (2007) 298ā312. doi:10.1016/j.tcs.2007.07.014 . Ā· doiĀ ā
- 8[8] B. Langmead, C. Trapnell, M. Pop, S. Salzberg, Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome Biology 10 (3) (2009) R 25. doi:10.1186/gb-2009-10-3-r 25 . Ā· doiĀ ā
