Using Minimum Path Cover to Boost Dynamic Programming on DAGs: Co-Linear Chaining Extended
Anna Kuosmanen, Topi Paavilainen, Travis Gagie, Rayan Chikhi,, Alexandru I. Tomescu, Veli M\"akinen

TL;DR
This paper extends co-linear chaining to DAGs using minimum path cover, enabling efficient alignment of sequencing reads on complex genome graphs, with algorithms that outperform previous bounds when the path cover size is small.
Contribution
It introduces a novel algorithm for minimum path cover in DAGs and a general technique to extend sequence DP algorithms to DAGs, applied to co-linear chaining in genomics.
Findings
New algorithm for minimum path cover in DAGs with O(k|E|log|V|) complexity.
General method to extend sequence DP algorithms to DAGs.
Efficient practical implementation for genome graph alignment.
Abstract
Aligning sequencing reads on graph representations of genomes is an important ingredient of pan-genomics. Such approaches typically find a set of local anchors that indicate plausible matches between substrings of a read to subpaths of the graph. These anchor matches are then combined to form a (semi-local) alignment of the complete read on a subpath. Co-linear chaining is an algorithmically rigorous approach to combine the anchors. It is a well-known approach for the case of two sequences as inputs. Here we extend the approach so that one of the inputs can be a directed acyclic graph (DAGs), e.g. a splicing graph in transcriptomics or a variant graph in pan-genomics. This extension to DAGs turns out to have a tight connection to the minimum path cover problem, asking for a minimum-cardinality set of paths that cover all the nodes of a DAG. We study the case when the size of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Genome Rearrangement Algorithms
11institutetext: Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Finland 22institutetext: Diego Portales University, Chile 33institutetext: CNRS, CRIStAL, University of Lille 1, France
Using Minimum Path Cover to Boost Dynamic Programming on DAGs: Co-Linear Chaining Extended
Anna Kuosmanen 11
Topi Paavilainen 11
Travis Gagie 22
Rayan Chikhi 33
Alexandru Tomescu*,⋆* 11
Veli Mäkinen*,* Shared last author contributionCorresponding author: [email protected]11
Abstract
Aligning sequencing reads on graph representations of genomes is an important ingredient of pan-genomics (Marschall et al. Briefings in Bioinformatics, 2016). Such approaches typically find a set of local anchors that indicate plausible matches between substrings of a read to subpaths of the graph. These anchor matches are then combined to form a (semi-local) alignment of the complete read on a subpath. Co-linear chaining is an algorithmically rigorous approach to combine the anchors. It is a well-known approach for the case of two sequences as inputs. Here we extend the approach so that one of the inputs can be a directed acyclic graph (DAGs), e.g. a splicing graph in transcriptomics or variant graph in pan-genomics.
The extension of co-linear chaining to DAGs turns out to have a tight connection to the minimum path cover problem that asks us to find a minimum-cardinality set of paths that cover all the nodes of a DAG. We study the case when the size of a minimum path cover is small, which is often the case in practice. First, we propose an algorithm for finding a minimum path cover of a DAG in time, improving all known time-bounds when is small and the DAG is not too dense. An immediate consequence is an improved space/time tradeoff for reachability queries in arbitrary directed graphs. Second, we introduce a general technique for extending dynamic programming (DP) algorithms from sequences to DAGs. This is enabled by our minimum path cover algorithm, and works by mimicking the DP algorithm for sequences on each path of the minimum path cover. This technique generally produces algorithms that are slower than their counterparts on sequences only by a factor . We illustrate this on two classical problems extended to labeled DAGs: longest increasing subsequence and longest common subsequence. For the former we obtain an algorithm with running time . This matches the optimal solution to the classical problem variant when, e.g., the input sequence is modeled as a path. We obtain an analogous result for the longest common subsequence problem. Finally, we apply this technique to the co-linear chaining problem, that is a generalization of both of the above two problems. The algorithm for this problem turns out to be more involved, needing further ingredients, such as an FM-index tailored for large alphabets, and a two-dimensional range search tree modified to support range maximum queries. We implemented the new co-linear chaining approach. Experiments on splicing graphs show that the new method is efficient also in practice.
1 Introduction
A path cover of a DAG is a set of paths such that every node of belongs to some path. A minimum path cover (MPC) is one having the minimum number of paths. The size of a MPC is also called the width of . Many DAGs commonly used in genome research, such as graphs encoding human mutations [9] and graphs modeling gene transcripts [18], can consist, in the former case, of millions of nodes and, in the latter case, of thousands of nodes. However, they generally have a small width on average; for example, splicing graphs for most genes in human chromosome 2 have width at most 10 [40, Fig. 7]. To the best of our knowledge, among the many MPC algorithms [15, 20, 35, 31, 7, 8], there are only three whose complexities depends on the width of the DAG. Say the width of is . The first algorithm runs in time and can be obtained by slightly modifying an algorithm for finding a minimum chain cover in partial orders from [12]. The other two algorithms are due to Chen and Chen: the first one works in time [7], and the second one works in time [8].
In this paper we present an MPC algorithm running in time . For example, for and , this is better than all previous algorithms. Our algorithm is based on the following standard reduction of a minimum flow problem to a maximum flow problem (see e.g. [2]): (i) find a feasible flow/path cover satisfying all demands, and (ii) solve a maximum flow problem in a graph encoding how much flow can be removed from every edge. Our main insight is to solve step (i) by finding an approximate solution that is greater than the optimal one only by a factor. Then, if we solve step (ii) with the Ford-Fulkerson algorithm, the number of iterations can be bounded by .
We then proceed to show that some problems (like pattern matching) that admit efficient sparse dynamic programming solutions on sequences [11] can be extended to DAGs, so that their complexity increases only by the minimum path cover size . Extending pattern matching to DAGs has been studied before [32, 3, 28]. For those edit distance -based formulations our approach does not yield an improvement, but on formulations involving sparse set of matching anchors [11] we can boost the naive solutions of their DAG extensions by exploiting a path cover. Namely, our improvement applies to many cases where a data structure over previously computed solutions is maintained and queried for computing the next value. Our new MPC algorithm enables this, as its complexity is generally of the same form as that of solving the extended problems. Given a path cover, our technique then computes so-called forward propagation links indicating how the partial solutions in each path in the cover must be synchronized.
To best illustrate the versatility of the technique itself, we show (in the Appendix) how to compute a longest increasing subsequence (LIS) in a labeled DAG, in time . This matches the optimal solution to the classical problem on a single sequence when, e.g., this is modeled as a path (where ). We also illustrate our technique with the longest common subsequence (LCS) problem between a labeled DAG and a sequence .
Finally, we consider the main problem of this paper—co-linear chaining (CLC)—first introduced in [27]. It has been proposed as a model of the sequence alignment problem that scales to massive inputs, and has been a subject of recent interest (see e.g. [33, 41, 43, 44, 45, 26, 36]). In the CLC problem, the input is directly assumed to be a set of pairs of intervals in the two sequences that match (either exactly or approximately). The CLC alignment solution asks for a subset of these plausible pairs that maximizes the coverage in one of the sequences, and whose elements appear in increasing order in both sequences. The fastest algorithm for this problem runs in the optimal time [1].
We define a generalization of the CLC problem between a sequence and a labeled DAG. As motivation, we mention the problem of aligning a long sequence, or even an entire chromosome, inside a DAG storing all known mutations of a population with respect to a reference genome (such as the above-mentioned [9], or more specificly a linearized version of it [17]). Here, the input pairs match intervals in the sequence with paths (also called anchors) in the DAG. This problem is not straightforward, as the topological order of the DAG might not follow the reachability order between the anchors.
Existing tools for aligning DNA sequences to DAGs (BGREAT [24], vg [29]) rely on anchors but do not explicitly consider solving CLC optimally on the DAG.
The algorithm we propose uses the general framework mentioned above. Since it is more involved, we will develop it in stages. We first give a simple approach to solve a relaxed co-linear chaining problem using time, then introduce the MPC approach that requires time. As above, if the DAG is a labeled path representing a sequence, the running time of our algorithm is reduced to the best current solution for the co-linear chaining problem on sequences, [1]. We conclude (in the Appendix) with a Burrows-Wheeler technique to efficiently handle a special case that we omitted in this relaxed variant. We remark that one can reduce the LIS and LCS problems to the CLC problem to obtain the same running time bounds as mentioned earlier; these are given for the sake of comprehensiveness.
In the last section we discuss the anchor-finding preprocessing step. We implemented the new MPC-based co-linear chaining algorithm and conducted experiments on splicing graphs to show that the approach is practical, once anchors are given. Some future directions on how to incorporate practical anchors, and how to apply the techniques to transcript prediction, are discussed.
Notation.
To simplify notation, for any DAG we will assume that is always and that is a topological order on (so that for every edge we have ). We will also assume that . A labeled DAG is a tuple where is a DAG and assign to the nodes labels from , being an ordered alphabet.
For a node , we denote by the set of in-neighbors of and by the set of out-neighbors of . If there is a (possibly empty) path from node to node we say that reaches . We denote by the set of nodes that reach . We denote a set of consecutive integers with interval notation , meaning . For a pair of intervals , we use , , , and to denote the four respective endpoints. We also consider pairs of the form where is a path, and use to access . The first node of will be called its startpoint, and its last node will be called its endpoint. For a set we may fix an order, to access an element as .
2 The MPC algorithm
In this section we assume basic familiarity with network flow concepts; see [2] for further details. In the minimum flow problem, we are given a directed graph with a single source and a single sink, with a demand for every edge. The task is to find a flow of minimum value (the value is the sum of the flow on the edges exiting the source) that satisfies all demands (to be called feasible). The standard reduction from the minimum path cover problem to a minimum flow one (see, e.g. [30]) creates a new DAG by replacing each node with two nodes , adds the edge and adds all in-neighbors of as in-neighbors of , and all out-neighbors of as out-neighbors of . Finally, the reduction adds a global source with an out-going edge to every node, and a global sink with an in-coming edge from every node. Edges of type get demand , and all other edges get demand [math]. The value of the minimum flow equals , the width of , and any decomposition of it into source-to-sink paths induces a minimum path cover in .
Our MPC algorithm is based on the following simple reduction of a minimum flow problem to a maximum flow one (see e.g. [2]): (i) find a feasible flow ; (ii) transform this into a minimum feasible flow, by finding a maximum flow in in which every now has capacity . The final minimum flow solution is obtained as , for every . Observe that this path cover induces a flow of value . Thus, in step (ii) we need to shrink this flow into a flow of value . If we run the Ford-Fulkerson algorithm, this means that there are successive augmenting paths, each of which can be found in time . This gives a time bound for step (ii) of .
We solve step (i) in time by finding a path cover in whose size is larger than only by a relative factor . This is based on the classical greedy set cover algorithm, see e.g. [42, Chapter 2]: at each step, select a path covering most of the remaining uncovered nodes.
This approach is similar to the one from [12] for finding the minimum number of chains to cover a partial order of size . A chain is a set of pairwise comparable elements. The algorithm from [12] runs in time , and it has the same feature as ours: it first finds a set of chains in the same way as us (longest chains covering most uncovered elements), and then in a second step reduces these to . However, if we were to apply this algorithm to DAGs, it would run in time , which is slower than our algorithm for small . This is because it uses the classical reduction given by Fulkerson [15] to a bipartite graph, where each edge of the graph encodes a pair of elements in the relation. Since DAGs are not transitive in general, to use this reduction one needs first to compute the transitive closure of the DAG, in time . Such approximation-refinement approach has also been applied to other covering problems on graphs, such as a 2-hop cover [10].
We now show how to solve step (i) within the claimed running time, by dynamic programming.
Lemma 1
Let be a DAG, and let be the width of . In time , we can compute a path cover of , such that .
Proof.
The algorithm works by choosing, at each step, a path that covers the most uncovered nodes. For every node , we store , if is not covered by any path, and otherwise. We also store as the largest number of uncovered nodes on a path starting at . The values are computed by dynamic programming, by traversing the nodes in inverse topological order and setting . Initially we have for all . We then compute for all , in time . By taking the node with the maximum , and tracing back along the optimal path starting at , we obtain our first path in time . We then update for all nodes on this path, and iterate this process until all nodes are covered. This takes overall time , where is the number of paths found.
This algorithm analysis is identical to the one of the classical greedy set cover algorithm [42, Chapter 2], because the universe to be covered is and each possible path in is a possible covering set, which implies that . ∎∎
Combining Lemma 1 with the above-mentioned application of the Ford-Fulkerson algorithm, we obtain our first result:
Theorem 2.1
Given a DAG of width , the MPC problem on can be solved in time .
3 The dynamic programming framework
In this section we give an overview of the main ideas of our approach.
Suppose we have a problem involving DAGs that is solvable, for example by dynamic programming, by traversing the nodes in topological order. Thus, assume also that a partial solution at each node is obtainable from all (and only) nodes of the DAG that can reach , plus some other independent objects, such as another sequence. Furthermore, suppose that at each node we need to query (and maintain) a data structure that depends on and such that the answer at is decomposable as:
[TABLE]
In the above, the sets are such that , they are not necessarily disjoint, and is some operation on the queries, such as min or max, that does not assume disjointness. It is understood that after the computation at , we need to update . It is also understood that once we have updated at , we cannot query for a node before in topological order, because it would give an incorrect answer.
The first idea is to decompose the graph into a path cover . As such, we decompose the computation only along these paths, in light of (1). We replace a single data structure with data structures , and perform the operation from (1) on the results of the queries to these data structures.
Our second idea concerns the order in which the nodes on these paths are processed. Because the answer at depends on , we cannot process the nodes on the paths (and update the corresponding ’s) in an arbitrary order. As such, for every path and every node , we distinguish the last node on path that reaches (if it exists). We will call this node . See Figure 1 for an example. We note that this insight is the same as in [21], which symmetrically identified the first node on a chain that can be reached from (a chain is a subsequence of a path). The following observation is the first ingredient for using the decomposition (1).
Observation 1
Let be a path cover of a DAG , and let . Let denote the set of nodes of from its beginning until inclusively (or the empty set, if does not exist). Then .
Proof.
It is clear that . To show the reverse inclusion, consider a node . Since is a path cover, then appears on some . Since reaches , then appears on before , or . Therefore appears on , as desired. ∎∎
This allows us to identify, for every node , a set of forward propagation links , where holds for any node and index with . These propagation links are the second ingredient in the correctness of the decomposition. Once we have computed the correct value at , we update the corresponding data structures for all paths to which belongs. We also propagate the query value of in the decomposition (1) for all nodes with . This means that when we come to process , we have already correctly computed all terms in the decomposition (1) and it suffices to apply the operation to these terms.
The next lemma shows how to compute the values (and, as a consequence, all forward propagation links), also by dynamic programming.
Lemma 2
Let be a DAG, and let be a path cover of . For every and every , we can compute in overall time .
Proof.
For each and every node on , let denote the position of in (say, starting from ). Our algorithm actually computes as the index of this node in . Initially, we set for all and . At the end of the algorithm, will hold precisely for those nodes that cannot be reached by any node of . We traverse the nodes in topological order. For every , we do as follows: if is on , then we set . Otherwise, we compute by dynamic programming as . ∎∎
An immediate application of Theorem 2.1 and of the values is for solving reachability queries (see Appendix 0.A.1). Another simple application is an extension of the longest increasing subsequence (LIS) problem to labeled DAGs (Appendix 0.A.2).
The LIS problem, the LCS problem of Section 4, as well as the CLC problem of Section 5 make use of the following standard data structure (see e.g. [25, p.20]).
Lemma 3
The following two operations can be supported with a balanced binary search tree in time , where is the number of leaves in the tree.
- •
: For the leaf with , update .
- •
: Return (Range Maximum Query).
Moreover, the balanced binary search tree can be built in time, given the pairs sorted by component .
4 The LCS problem
Consider a labeled DAG and a sequence , where is an ordered alphabet. We say that the longest common subsequence (LCS) between and is a longest subsequence of any path label in such that is also a subsequence of .
We will modify the LIS algorithm of Appendix 0.A.2 minimally to find a LCS between a DAG and a sequence . The description is self-contained yet, for the interest of page limit, more dense than the LIS algorithm derivation. The purpose is to give an example of the general MPC-framework with fewer technical details than required in the main result of this paper concerning co-linear chaining.
For any , let denote set . For each node and each , we aim to store in the length of the longest common subsequence between and any label of path ending at , among all subsequences having as the last symbol.
Assume we have a path cover of size and computed for all . Assume also we have mapped to in time (e.g. by sorting the symbols of , binary searching labels of , and then relabeling by ranks, with the exception that, if a node label does not appear in , it is replaced by ).
Let be a search tree of Lemma 3 initialized with key-value pairs , , , …, , for each . The algorithm proceeds in fixed topological ordering on . At a node , for every we now update an array for all as follows: . The update step of when the algorithm reaches a node , for each covering path containing , is done as for all with and . Initialization is handled by the key-value pair so that any with can start a new common subsequence.
The final answer to the problem is , with the actual LCS to be found with a standard traceback. The algorithm runs in time, where , and assuming a cover of paths is given. Notice that can be . With Theorem 2.1 plugged in, the total running time becomes . Since the queries on the data structures are semi-open, one can use the more efficient data structure from [16] to improve the bound to . The following theorem summarizes this result.
Theorem 4.1
Let be a labeled DAG of width , and let , where is an ordered alphabet. We can find a longest common subsequence between and in time .
When is a path, the bound improves to , which nearly matches the fastest sparse dynamic programming algorithm for the LCS on two sequences [11] (with a difference in -factor due to a different data structure, which does not work for this order of computation).
5 Co-linear chaining
We start with a formal definition of the co-linear chaining problem (see Figure 2 for an illustration), following the notions introduced in [25, Section 15.4].
Problem 1 (Co-linear chaining (CLC))
Let and be two sequences over an alphabet , and let be a set of pairs . Find an ordered subset of pairs from such that
- •
and , for all , and
- •
maximizes the ordered coverage of , defined as
[TABLE]
The definition of ordered coverage between two sequences is symmetric, as we can simply exchange the roles of and . But when solving the CLC problem between a DAG and a sequence, we must choose whether we want to maximize the ordered coverage on the sequence or on the DAG . We will consider the former variant.
First, we define the following precedence relation:
Definition 1
Given two paths and in a DAG , we say that precedes , and write , if one of the following conditions holds:
- •
and do not share nodes and there is a path in from the endpoint of to the startpoint of , or
- •
and have a suffix-prefix overlap and is not fully contained in ; that is, if and then there exists a such that , , …, .
We then extend the formulation of Problem 1 to handle a sequence and a DAG.
Problem 2 (CLC between a sequence and a DAG)
Let be a sequence, let be a labeled DAG, and let be a set of pairs , where is a path in and are non-negative integers. Find an ordered subset of pairs from such that
- •
for all , it holds that and , and
- •
maximizes the ordered coverage of , analogously defined as .
To illustrate the main technique of this paper, let us for now only seek solutions where paths in consecutive pairs in a solution do not overlap in the DAG. Suffix-prefix overlaps between paths turn out to be challenging; we will postpone this case until Appendix 0.B.
Problem 3 (Overlap-limited CLC between a sequence and a DAG)
Let be a sequence, let be a labeled DAG, and let be a set of pairs , where is a path in and are non-negative integers (with the interpretation that matches ). Find an ordered subset of pairs from such that
- •
for all , it holds that there is a non-empty path from the last node of to the first node of and , and
- •
maximizes .
First, let us consider a trivial approach to solve Problem 3. Assume we have ordered in time the input pairs as , so that the endpoints of are in topological order, breaking ties arbitrarily. We denote by the maximum ordered coverage of using the pair and any subset of pairs from .
Theorem 5.1
Overlap-limited co-linear chaining between a sequence and a labeled DAG (Problem 3) on input pairs can be solved in time.
Proof.
First, we reverse the edges of . Then we mark the nodes that correspond to the path endpoints for every pair. After this preprocessing we can start computing the maximum ordered coverage for the pairs as follows: for every pair in topological order of their path endpoints for we do a depth-first traversal starting at the startpoint of path . Note that since the edges are reversed, the depth-first traversal checks only pairs whose paths are predecessors of .
Whenever we encounter a node that corresponds to the path endpoint of a pair , we first examine whether it fulfills the criterion (call this case (a)). The best ordered coverage using pair after all such is then
[TABLE]
where is the best ordered coverage when using pairs last.
If pair does not fulfill the criterion for case (a), we then check whether (call this case (b)). The best ordered coverage using pair after all such with is then
[TABLE]
Inclusions, i.e. , can be left computed incorrectly in , since there is a better or equally good solution computed in or that does not use them [1].
Finally, we take . Depth-first traversal takes time and is executed times, for total time. ∎∎
However, we can do significantly better than time. In the next sections we will describe how to apply the framework from Section 3 here.
5.1 Co-linear chaining on sequences revisited
We now describe the dynamic programming algorithm from [1] for the case of two sequences, as we will then reuse this same algorithm in our MPC approach.
First, sort input pairs in by the coordinate into the sequence , , …, , so that holds for all . This will ensure that we consider the overlapping ranges in sequence in the correct order. Then, we fill a table analogous to that of Theorem 5.1 so that gives the maximum ordered coverage of using the pair and any subset of pairs from . Hence, gives the total maximum ordered coverage of .
Consider Equations (2) and (3). Now we can use an invariant technique to convert these recurrence relations so that we can exploit the range maximum queries of Lemma 3:
[TABLE]
For these to work correctly, we need to have properly updated the trees and for all . That is, we need to call and after computing each . The running time is .
Figure 2 illustrates the optimal chain on our schematic example. This chain can be extracted by modifying the algorithm to store traceback pointers.
Theorem 5.2 ([36, 1])
Problem 1 on input pairs can be solved in the optimal time.
5.2 Co-linear chaining on DAGs using a minimum path cover
Let us now modify the above algorithm to work with DAGs, using the main technique of this paper.
Theorem 5.3
Problem 3 on a labeled DAG of width and a set of input pairs can be solved in time time.
Proof.
Assume we have a path cover of size and computed for all . For each path , we create two binary search trees and . As a reminder, these trees correspond to coverages for pairs that do not, and do overlap, respectively, on the sequence. Moreover, recall that in Problem 3 we do not consider solutions where consecutive paths in the graph overlap.
As keys, we use , for every pair , and additionally the key 0. The value of every key is initialized to .
After these preprocessing steps, we process the nodes in topological order, as detailed in Algorithm 1. If node corresponds to the endpoint of some , we update the trees and for all covering paths containing node . Then we follow all forward propagation links and update for each path starting at , taking into account all pairs whose path endpoints are in covering path . Before the main loop visits , we have processed all forward propagation links to , and the computation of has taken all previous pairs into account, as in the naive algorithm, but now indirectly through the search trees. Exceptions are the pairs overlapping in the graph, which we omit in this problem statement. The forward propagation ensures that the search tree query results are indeed taking only reachable pairs into account. While is already computed when visiting , the startpoint of , the added coverage with the pair is updated to the search trees only when visiting the endpoint.
There are forward propagation links, and both search trees are queried in time. All the search trees containing a path endpoint of a pair are updated. Each endpoint can be contained in at most paths, so this also gives the same bound on the number of updates. With Theorem 2.1 plugged in, we have and the total running time becomes . ∎∎
Appendix 0.B shows how to handle the case of path overlaps, giving the following result:
Theorem 5.4
Let be a labeled DAG and let be a set of pairs of the form . The algorithms from Theorems 5.1 and 5.3 can be modified to solve Problem 2 with additional time or , where is at most the input length and is the number of overlaps between the input paths.
The bound comes as a direct consequence of using a generalized suffix tree to compute the overlaps in advance [34, proof of Theorem 2]. With the overlaps given, one can process each in constant time to see if they give the maximum for . The other bound comes from backward searching a subpath in a concatenation of all subpaths using FM-index. At each step a range can be identified that contains all subpaths with suffix-prefix overlap with the current one. This range limits the keys suitable in the binary search trees storing the already computed coverage values, and thus a two-dimensional range search is needed (see Appendix 0.B).
6 Discussion and experiments
For applying our solutions to Problem 2 in practice, one first needs to find the alignment anchors. As explained in the problem formulation, alignment anchors are such pairs where is a path in and matches . With sequence inputs, such pairs are usually taken to be maximal exact matches (MEMs) and can be retrieved in small space in linear time [5, 4]. It is largely an open problem how to retrieve MEMs between a sequence and a DAG efficiently: The case of length-limited MEMs is studied in [37], based on an extension of [38] with features such as suffix tree functionality. On the practical side, anchor finding has already been incorporated into tools for conducting alignment of a sequence to a DAG [24, 29].
For the purpose of demonstrating the efficiency of our MPC-approach applied to co-linear chaining, we implemented a MEM-finding routine based on simple dynamic programming. We leave it for future work to incorporate a practical procedure (e.g. like those in [24, 29]). We tested the time improvement of our MPC-approach (Theorem 5.3) over the trivial algorithm (Theorem 5.1) on the sequence graphs of annotated human genes. Out of all the 62219 genes in the HG38 annotation for all human chromosomes, we singled out 8628 genes such that their sequence graph had at least 5000 nodes. Out of these, we picked 500 genes at random.
The size of the graphs for these 500 genes varied between and vertices. Their width, i.e., the number of paths in the MPC, varied between and . (The number of graphs for each value of is listed in the column #graphs of the top table of Figure 3.) The number of anchors, , for patterns of length 1000 varied between and . As shown in Figure 3, with small values of , our MPC-based co-linear chaining algorithm was twice as fast as the trivial algorithm. When values of were increased from to , the difference increased to two orders of magnitude.
The improved efficiency when compared to the naive approach gives reason to believe a practical sequence-to-DAG aligner can be engineered along the algorithmic foundations given here. Future work includes the incorporation of a practical anchor-finding method, and testing whether the complete scheme improves transcript prediction through improved finding of exon chains [23].
On the theoretical side, it remains open whether the MPC algorithm could benefit from a better initial approximation and/or one that is faster to compute. More generally, it remains open whether the overall bound for the MPC problem can be improved.
Acknowledgements.
We thank Djamal Belazzougui for pointers on backward step on large alphabet and Gonzalo Navarro for pointing out the connection to pattern matching on hypertexts. This work was funded in part by the Academy of Finland (grant 274977 to AIT and grants 284598 and 309048 to AK and to VM), and by Futurice Oy (to TP).
Appendix 0.A Two simple applications
0.A.1 Reachability queries
An immediate application of Theorem 2.1 and of the values is for solving reachability queries. If we have all these values, then we can answer in constant time whether a node is reachable from a node , as in [21]: we check , where was defined in the proof of Lemma 2, is a path containing , and we take by convention . Recall also that reachability queries in an arbitrary graph can be reduced to solving reachability queries in its DAG of strongly connected components, because nodes in the same component are pairwise reachable. See Table 1 for existing tradeoffs for solving reachability queries.111Note that [7] incorrectly attributes to [21] query time , and as a consequence [39, 22] incorrectly mention query time for [7].
Corollary 1
Let be an arbitrary directed graph and let the width of its DAG of strongly connected components be . In time we can construct from an index of size , so that for any we can answer in time whether is reachable from .
0.A.2 The LIS problem
The LIS problem asks us to delete the minimum number of values from an input sequence such that remaining values form a strictly increasing series of values. Here the input sequence is assumed to come from an ordered alphabet . For example, on input sequence , from the alphabet , the unique optimal solution is . Such a longest increasing subsequence can be found in the optimal time [14].
This optimal algorithm works as follows. We first map to a subset of with an order-preserving mapping, in time (by e.g., sorting the sequence elements, and relabeling by the ranks in the order of distinct values). We then store, at every index of the input sequence, the value defined as the length of the longest strictly increasing subsequence ending at and using the -th symbol. The values can be computed by dynamic programming, by storing all previous key-value pairs in a search tree as in Lemma 3, and querying .
Consider the following extension of the LIS problem to a labeled DAG of width . For a path in , let the label of , denoted , be the concatenation of the labels of the nodes of , namely . Among all paths in , and among all subsequences of , we need to find a longest strictly increasing subsequence.
We now explain how to extend the previous dynamic programming algorithm for this problem. We analogously map to a subset of with an order-preserving mapping in time, as above. Recall that we assume , where is a topological order. Assume also that we have paths to cover and is computed for all .
For each node , we aim to analogously compute as the length of a longest strictly increasing subsequence of the labels of all paths ending at , with the property that is the last element of this subsequence.
For each , we let be a search tree as in Lemma 3, initialized with key-value pairs . The algorithm proceeds in the fixed topological ordering. Assume now that we are at some position , and have already updated all search trees associated with the covering paths going through . For every , we update . Once the algorithm reaches in the topological ordering, value has been updated from all such that . It remains to show how to update each when reaching , for all covering paths on which occurs. This is done as . Initialization is handled by the key-value pair so that any position can start a new increasing subsequence. Figure 4 shows an example.
The final answer to the problem is , with the actual LIS to be found with a standard traceback. The algorithm runs in time. With Theorem 2.1 plugged in, we have and the total running time becomes , under our assumption . The following theorem summarizes this result.
Theorem 0.A.1
Let be a labeled DAG of width , where is an ordered alphabet. We can find a longest increasing subsequence in in time .
When the DAG is just a labeled path with (modeling the standard LIS problem), then the algorithm from Lemma 1 returns one path (). The complexity is then , matching the best possible bound for the standard LIS problem [14].
Appendix 0.B Co-linear chaining with path overlaps
We now consider how to extend the algorithms we developed for Problem 3 to work for the more general case of Problem 2, where overlaps between paths are allowed in a solution. The detection and merging of such path overlaps has been studied in [34], and we tailor a similar approach for our purposes.
We use an FM-index [13] tailored for large alphabets [19], and a two-dimensional range search tree [6] modified to support range maximum queries. The former is used for obtaining all ranges in the coverage array such that all input pairs have a path , , overlapping with the path of -th input pair . Here, the endpoint of the -th input pair is at node visited in topological order. This implies that the paths of the input pairs have already been visited, and thus, by induction, that values have been correctly computed (subject to the modification we are about to study). The sequence ranges for all may be arbitrarily located with respect to interval , so we need to maintain an analogous mechanism with search trees of type and as in our co-linear chaining algorithm based on a path cover. This time we cannot, in advance, separate the input pairs to paths with different search trees, but we have a dynamic setting, with interval deciding which values should be taken into account. This is where a two-dimensional range search tree is used to support these queries in time: Figure 5 illustrates this.
In what follows we show that queries are sufficient to take all overlaps into account throughout the algorithm execution (holding for both the trivial algorithm and for the one based on a path cover), where —the sum of the path lengths—is at most the total input length. The construction will actually induce an order for the input pairs such that queries are sufficient: Since the other parts of the algorithms do not use the order of input pairs directly, we can safely reorganize the input accordingly.
With this introduction, we are ready to consider how all the intervals related to -th pair are obtained. We build in time the FM-index version proposed in [19] of sequences , where is a symbol not in alphabet and considered smaller than other symbols, e.g. , and denotes the reverse of .
For our purposes it is sufficient to know that the FM-index of , when given an interval corresponding to lexicographically-ordered suffixes that start with , can determine the interval in time [19]. This operation is called backward step.
We use the index to search in the forward direction by searching its reverse with backward steps. Consider we have found interval , for some , such that backward step results in a non-empty interval . This interval corresponds to all suffixes of that have as a prefix. That is, corresponds to input pairs whose path suffix have a length overlap with the path prefix of -th input pair. For this interval to match with coverage array , we just need to rearrange the input pairs according to their order in the first rows of the array storing the lexicographic order of suffixes of .
Since each backward step on the index may induce a range search on exactly one interval , the running time is dominated by the range queries.222A simple wavelet tree based FM-index would provide the same bound, but in case the range search part is later improved, we used the best bound for the subroutine. On the other hand, this also gives the bound on the number of range queries, as claimed earlier.
Alternatively, one can omit the expensive range queries and process each overlapping pair separately, to compute in constant time its contribution to . This gives another bound , where is the number of overlaps between the input paths. This can be improved to by using a generalized suffix tree to compute the overlaps in advance [34, proof of Theorem 2].
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Mohamed Ibrahim Abouelhoda. A Chaining Algorithm for Mapping c DNA Sequences to Multiple Genomic Sequences. In 14th International Symposium on String Processing and Information Retrieval , volume 4726 of LNCS , pages 1–13. Springer, 2007.
- 2[2] Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network Flows: Theory, Algorithms, and Applications . Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993.
- 3[3] Amihood Amir, Moshe Lewenstein, and Noa Lewenstein. Pattern matching in hypertext. J. Algorithms , 35(1):82–99, 2000.
- 4[4] Djamal Belazzougui. Linear time construction of compressed text indices in compact space. In Proc. Symposium on Theory of Computing STOC 2014 , pages 148–193. ACM, 2014.
- 5[5] Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Versatile Succinct Representations of the Bidirectional Burrows-Wheeler Transform. In Proc. 21st Annual European Symposium on Algorithms (ESA 2013) , volume 8125 of LNCS , pages 133–144. Springer, 2013.
- 6[6] Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. Computational Geometry: Algorithms and Applications . Springer-Verlag TELOS, Santa Clara, CA, USA, 3rd ed. edition, 2008.
- 7[7] Y. Chen and Y. Chen. An Efficient Algorithm for Answering Graph Reachability Queries. In 2008 IEEE 24th International Conference on Data Engineering , pages 893–902, April 2008.
- 8[8] Y. Chen and Y. Chen. On the graph decomposition. In 2014 IEEE Fourth International Conference on Big Data and Cloud Computing , pages 777–784, Dec 2014.
