Using Minimum Path Cover to Boost Dynamic Programming on DAGs: Co-Linear   Chaining Extended

Anna Kuosmanen; Topi Paavilainen; Travis Gagie; Rayan Chikhi,; Alexandru I. Tomescu; Veli M\"akinen

arXiv:1705.08754·cs.DS·January 30, 2018

Using Minimum Path Cover to Boost Dynamic Programming on DAGs: Co-Linear Chaining Extended

Anna Kuosmanen, Topi Paavilainen, Travis Gagie, Rayan Chikhi,, Alexandru I. Tomescu, Veli M\"akinen

PDF

Open Access 1 Repo

TL;DR

This paper extends co-linear chaining to DAGs using minimum path cover, enabling efficient alignment of sequencing reads on complex genome graphs, with algorithms that outperform previous bounds when the path cover size is small.

Contribution

It introduces a novel algorithm for minimum path cover in DAGs and a general technique to extend sequence DP algorithms to DAGs, applied to co-linear chaining in genomics.

Findings

01

New algorithm for minimum path cover in DAGs with O(k|E|log|V|) complexity.

02

General method to extend sequence DP algorithms to DAGs.

03

Efficient practical implementation for genome graph alignment.

Abstract

Aligning sequencing reads on graph representations of genomes is an important ingredient of pan-genomics. Such approaches typically find a set of local anchors that indicate plausible matches between substrings of a read to subpaths of the graph. These anchor matches are then combined to form a (semi-local) alignment of the complete read on a subpath. Co-linear chaining is an algorithmically rigorous approach to combine the anchors. It is a well-known approach for the case of two sequences as inputs. Here we extend the approach so that one of the inputs can be a directed acyclic graph (DAGs), e.g. a splicing graph in transcriptomics or a variant graph in pan-genomics. This extension to DAGs turns out to have a tight connection to the minimum path cover problem, asking for a minimum-cardinality set of paths that cover all the nodes of a DAG. We study the case when the size $k$ of a…

Tables1

Table 1. Table 1: Previous comparable space/time tradeoffs for solving reachability queries. Compiled from [ 39 , Table 1] .

Construction time	Index size	Query time	Reference
$O (k \| E \| \log \| V \|)$	$O (k \| V \|)$	$O (1)$	this paper
$O ({\| V \|}^{2} + k \sqrt{k} \| V \|)$	$O (k \| V \|)$	$O (1)$	[7]
$O (k \| E \|)$ or $O (\| V \| \| E \|)$	$O (k \| V \|)$	$O (\log^{2} k)$	[22]
$O (k (\| V \| + \| E \|))$	$O (k \| V \|)$	$O (k)$ or $O (\| V \| + \| E \|)$	[46]

Equations12

Query (R^{-} (v)) = i ⨁ Query (R_{i}^{-} (v)) .

Query (R^{-} (v)) = i ⨁ Query (R_{i}^{-} (v)) .

coverage (R, S) = ∣ {i \in [1..∣ R ∣] ∣ i \in [s_{j} . c .. s_{j} . d] for some 1 \leq j \leq p} ∣.

coverage (R, S) = ∣ {i \in [1..∣ R ∣] ∣ i \in [s_{j} . c .. s_{j} . d] for some 1 \leq j \leq p} ∣.

C^{a} [j] = j^{'} : M [j^{'}] . d < M [j] . c max {C [j^{'}] + (M [j] . d - M [j] . c + 1)},

C^{a} [j] = j^{'} : M [j^{'}] . d < M [j] . c max {C [j^{'}] + (M [j] . d - M [j] . c + 1)},

C^{b} [j] = j^{'} : M [j] . c \leq M [j^{'}] . d \leq M [j] . d max {C [j^{'}] + (M [j] . d - M [j^{'}] . d)} .

C^{b} [j] = j^{'} : M [j] . c \leq M [j^{'}] . d \leq M [j] . d max {C [j^{'}] + (M [j] . d - M [j^{'}] . d)} .

C^{a} [j]

C^{a} [j]

C^{b} [j]

C [j]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Anna-Kuosmanen/DAGChainer
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Genome Rearrangement Algorithms

Full text

11institutetext: Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Finland 22institutetext: Diego Portales University, Chile 33institutetext: CNRS, CRIStAL, University of Lille 1, France

Using Minimum Path Cover to Boost Dynamic Programming on DAGs: Co-Linear Chaining Extended

Anna Kuosmanen 11

Topi Paavilainen 11

Travis Gagie 22

Rayan Chikhi 33

Alexandru Tomescu*,⋆* 11

Veli Mäkinen*,* Shared last author contributionCorresponding author: [email protected]11

Abstract

Aligning sequencing reads on graph representations of genomes is an important ingredient of pan-genomics (Marschall et al. Briefings in Bioinformatics, 2016). Such approaches typically find a set of local anchors that indicate plausible matches between substrings of a read to subpaths of the graph. These anchor matches are then combined to form a (semi-local) alignment of the complete read on a subpath. Co-linear chaining is an algorithmically rigorous approach to combine the anchors. It is a well-known approach for the case of two sequences as inputs. Here we extend the approach so that one of the inputs can be a directed acyclic graph (DAGs), e.g. a splicing graph in transcriptomics or variant graph in pan-genomics.

The extension of co-linear chaining to DAGs turns out to have a tight connection to the minimum path cover problem that asks us to find a minimum-cardinality set of paths that cover all the nodes of a DAG. We study the case when the size $k$ of a minimum path cover is small, which is often the case in practice. First, we propose an algorithm for finding a minimum path cover of a DAG $(V,E)$ in $O(k|E|\log|V|)$ time, improving all known time-bounds when $k$ is small and the DAG is not too dense. An immediate consequence is an improved space/time tradeoff for reachability queries in arbitrary directed graphs. Second, we introduce a general technique for extending dynamic programming (DP) algorithms from sequences to DAGs. This is enabled by our minimum path cover algorithm, and works by mimicking the DP algorithm for sequences on each path of the minimum path cover. This technique generally produces algorithms that are slower than their counterparts on sequences only by a factor $k$ . We illustrate this on two classical problems extended to labeled DAGs: longest increasing subsequence and longest common subsequence. For the former we obtain an algorithm with running time $O(k|E|\log|V|)$ . This matches the optimal solution to the classical problem variant when, e.g., the input sequence is modeled as a path. We obtain an analogous result for the longest common subsequence problem. Finally, we apply this technique to the co-linear chaining problem, that is a generalization of both of the above two problems. The algorithm for this problem turns out to be more involved, needing further ingredients, such as an FM-index tailored for large alphabets, and a two-dimensional range search tree modified to support range maximum queries. We implemented the new co-linear chaining approach. Experiments on splicing graphs show that the new method is efficient also in practice.

1 Introduction

A path cover of a DAG $G=(V,E)$ is a set of paths such that every node of $G$ belongs to some path. A minimum path cover (MPC) is one having the minimum number of paths. The size of a MPC is also called the width of $G$ . Many DAGs commonly used in genome research, such as graphs encoding human mutations [9] and graphs modeling gene transcripts [18], can consist, in the former case, of millions of nodes and, in the latter case, of thousands of nodes. However, they generally have a small width on average; for example, splicing graphs for most genes in human chromosome 2 have width at most 10 [40, Fig. 7]. To the best of our knowledge, among the many MPC algorithms [15, 20, 35, 31, 7, 8], there are only three whose complexities depends on the width of the DAG. Say the width of $G$ is $k$ . The first algorithm runs in time $O(|V||E|+k|V|^{2})$ and can be obtained by slightly modifying an algorithm for finding a minimum chain cover in partial orders from [12]. The other two algorithms are due to Chen and Chen: the first one works in time $O(|V|^{2}+k\sqrt{k}|V|)$ [7], and the second one works in time $O(\max(\sqrt{|V|}|E|,k\sqrt{k}|V|))$ [8].

In this paper we present an MPC algorithm running in time $O(k|E|\log|V|)$ . For example, for $k=o(\sqrt{|V|}/\log|V|)$ and $|E|=O(|V|^{3/2})$ , this is better than all previous algorithms. Our algorithm is based on the following standard reduction of a minimum flow problem to a maximum flow problem (see e.g. [2]): (i) find a feasible flow/path cover satisfying all demands, and (ii) solve a maximum flow problem in a graph encoding how much flow can be removed from every edge. Our main insight is to solve step (i) by finding an approximate solution that is greater than the optimal one only by a $O(\log|V|)$ factor. Then, if we solve step (ii) with the Ford-Fulkerson algorithm, the number of iterations can be bounded by $O(k\log|V|)$ .

We then proceed to show that some problems (like pattern matching) that admit efficient sparse dynamic programming solutions on sequences [11] can be extended to DAGs, so that their complexity increases only by the minimum path cover size $k$ . Extending pattern matching to DAGs has been studied before [32, 3, 28]. For those edit distance -based formulations our approach does not yield an improvement, but on formulations involving sparse set of matching anchors [11] we can boost the naive solutions of their DAG extensions by exploiting a path cover. Namely, our improvement applies to many cases where a data structure over previously computed solutions is maintained and queried for computing the next value. Our new MPC algorithm enables this, as its complexity is generally of the same form as that of solving the extended problems. Given a path cover, our technique then computes so-called forward propagation links indicating how the partial solutions in each path in the cover must be synchronized.

To best illustrate the versatility of the technique itself, we show (in the Appendix) how to compute a longest increasing subsequence (LIS) in a labeled DAG, in time $O(k|E|\log|V|)$ . This matches the optimal solution to the classical problem on a single sequence when, e.g., this is modeled as a path (where $k=1$ ). We also illustrate our technique with the longest common subsequence (LCS) problem between a labeled DAG $G=(V,E)$ and a sequence $S$ .

Finally, we consider the main problem of this paper—co-linear chaining (CLC)—first introduced in [27]. It has been proposed as a model of the sequence alignment problem that scales to massive inputs, and has been a subject of recent interest (see e.g. [33, 41, 43, 44, 45, 26, 36]). In the CLC problem, the input is directly assumed to be a set of $N$ pairs of intervals in the two sequences that match (either exactly or approximately). The CLC alignment solution asks for a subset of these plausible pairs that maximizes the coverage in one of the sequences, and whose elements appear in increasing order in both sequences. The fastest algorithm for this problem runs in the optimal $O(N\log N)$ time [1].

We define a generalization of the CLC problem between a sequence and a labeled DAG. As motivation, we mention the problem of aligning a long sequence, or even an entire chromosome, inside a DAG storing all known mutations of a population with respect to a reference genome (such as the above-mentioned [9], or more specificly a linearized version of it [17]). Here, the $N$ input pairs match intervals in the sequence with paths (also called anchors) in the DAG. This problem is not straightforward, as the topological order of the DAG might not follow the reachability order between the anchors.

Existing tools for aligning DNA sequences to DAGs (BGREAT [24], vg [29]) rely on anchors but do not explicitly consider solving CLC optimally on the DAG.

The algorithm we propose uses the general framework mentioned above. Since it is more involved, we will develop it in stages. We first give a simple approach to solve a relaxed co-linear chaining problem using $O((|V|+|E|)N)$ time, then introduce the MPC approach that requires $O(k|E|\log|V|+kN\log N)$ time. As above, if the DAG is a labeled path representing a sequence, the running time of our algorithm is reduced to the best current solution for the co-linear chaining problem on sequences, $O(N\log N)$ [1]. We conclude (in the Appendix) with a Burrows-Wheeler technique to efficiently handle a special case that we omitted in this relaxed variant. We remark that one can reduce the LIS and LCS problems to the CLC problem to obtain the same running time bounds as mentioned earlier; these are given for the sake of comprehensiveness.

In the last section we discuss the anchor-finding preprocessing step. We implemented the new MPC-based co-linear chaining algorithm and conducted experiments on splicing graphs to show that the approach is practical, once anchors are given. Some future directions on how to incorporate practical anchors, and how to apply the techniques to transcript prediction, are discussed.

Notation.

To simplify notation, for any DAG $G=(V,E)$ we will assume that $V$ is always $\{1,\dots,|V|\}$ and that $1,\dots,|V|$ is a topological order on $V$ (so that for every edge $(u,v)$ we have $u<v$ ). We will also assume that $|E|\geq|V|-1$ . A labeled DAG is a tuple $(V,E,\ell,\Sigma)$ where $(V,E)$ is a DAG and $\ell:V\mapsto\Sigma$ assign to the nodes labels from $\Sigma$ , $\Sigma$ being an ordered alphabet.

For a node $v\in V$ , we denote by $N^{-}(v)$ the set of in-neighbors of $v$ and by $N^{+}(v)$ the set of out-neighbors of $v$ . If there is a (possibly empty) path from node $u$ to node $v$ we say that $u$ reaches $v$ . We denote by $R^{-}(v)$ the set of nodes that reach $v$ . We denote a set of consecutive integers with interval notation $[i..j]$ , meaning $\{i,i+1,\ldots,j\}$ . For a pair of intervals $m=([x..y],[c..d])$ , we use $m.x$ , $m.y$ , $m.c$ , and $m.d$ to denote the four respective endpoints. We also consider pairs of the form $m=(P,[c..d])$ where $P$ is a path, and use $m.P$ to access $P$ . The first node of $P$ will be called its startpoint, and its last node will be called its endpoint. For a set $M$ we may fix an order, to access an element as $M[i]$ .

2 The MPC algorithm

In this section we assume basic familiarity with network flow concepts; see [2] for further details. In the minimum flow problem, we are given a directed graph $G=(V,E)$ with a single source and a single sink, with a demand $d:E\rightarrow\mathbb{Z}$ for every edge. The task is to find a flow of minimum value (the value is the sum of the flow on the edges exiting the source) that satisfies all demands (to be called feasible). The standard reduction from the minimum path cover problem to a minimum flow one (see, e.g. [30]) creates a new DAG $G^{\ast}$ by replacing each node $v$ with two nodes $v^{-},v^{+}$ , adds the edge $(v^{-},v^{+})$ and adds all in-neighbors of $v$ as in-neighbors of $v^{-}$ , and all out-neighbors of $v$ as out-neighbors of $v^{+}$ . Finally, the reduction adds a global source with an out-going edge to every node, and a global sink with an in-coming edge from every node. Edges of type $(v^{-},v^{+})$ get demand $1$ , and all other edges get demand [math]. The value of the minimum flow equals $k$ , the width of $G$ , and any decomposition of it into source-to-sink paths induces a minimum path cover in $G$ .

Our MPC algorithm is based on the following simple reduction of a minimum flow problem to a maximum flow one (see e.g. [2]): (i) find a feasible flow $f:E\rightarrow\mathbb{Z}$ ; (ii) transform this into a minimum feasible flow, by finding a maximum flow $f^{\prime}$ in $G$ in which every $e\in E$ now has capacity $f(e)-d(e)$ . The final minimum flow solution is obtained as $f(e)-f^{\prime}(e)$ , for every $e\in E$ . Observe that this path cover induces a flow of value $O(k\log|V|)$ . Thus, in step (ii) we need to shrink this flow into a flow of value $k$ . If we run the Ford-Fulkerson algorithm, this means that there are $O(k\log|V|)$ successive augmenting paths, each of which can be found in time $O(E)$ . This gives a time bound for step (ii) of $O(k|E|\log|V|)$ .

We solve step (i) in time $O(k|E|\log|V|)$ by finding a path cover in $G^{\ast}$ whose size is larger than $k$ only by a relative factor $O(\log|V|)$ . This is based on the classical greedy set cover algorithm, see e.g. [42, Chapter 2]: at each step, select a path covering most of the remaining uncovered nodes.

This approach is similar to the one from [12] for finding the minimum number $k$ of chains to cover a partial order of size $n$ . A chain is a set of pairwise comparable elements. The algorithm from [12] runs in time $O(kn^{2})$ , and it has the same feature as ours: it first finds a set of $O(k\log n)$ chains in the same way as us (longest chains covering most uncovered elements), and then in a second step reduces these to $k$ . However, if we were to apply this algorithm to DAGs, it would run in time $O(|V||E|+k|V|^{2})$ , which is slower than our algorithm for small $k$ . This is because it uses the classical reduction given by Fulkerson [15] to a bipartite graph, where each edge of the graph encodes a pair of elements in the relation. Since DAGs are not transitive in general, to use this reduction one needs first to compute the transitive closure of the DAG, in time $O(|V||E|)$ . Such approximation-refinement approach has also been applied to other covering problems on graphs, such as a 2-hop cover [10].

We now show how to solve step (i) within the claimed running time, by dynamic programming.

Lemma 1

Let $G=(V,E)$ be a DAG, and let $k$ be the width of $G$ . In time $O(k|E|\log|V|)$ , we can compute a path cover $P_{1},\dots,P_{K}$ of $G$ , such that $K=O(k\log|V|)$ .

Proof.

The algorithm works by choosing, at each step, a path that covers the most uncovered nodes. For every node $v\in V$ , we store $\mathtt{m}[v]=1$ , if $v$ is not covered by any path, and $\mathtt{m}[v]=0$ otherwise. We also store $\mathtt{u}[v]$ as the largest number of uncovered nodes on a path starting at $v$ . The values $\mathtt{u}[\cdot]$ are computed by dynamic programming, by traversing the nodes in inverse topological order and setting $\mathtt{u}[v]=\mathtt{m}[v]+\max_{w\in N^{+}(v)}\mathtt{u}[v]$ . Initially we have $\mathtt{m}[v]=1$ for all $v$ . We then compute $\mathtt{u}[v]$ for all $v$ , in time $O(|E|)$ . By taking the node $v$ with the maximum $\mathtt{u}[v]$ , and tracing back along the optimal path starting at $v$ , we obtain our first path in time $O(|E|)$ . We then update $\mathtt{m}[v]=0$ for all nodes on this path, and iterate this process until all nodes are covered. This takes overall time $O(K|E|)$ , where $K$ is the number of paths found.

This algorithm analysis is identical to the one of the classical greedy set cover algorithm [42, Chapter 2], because the universe to be covered is $V$ and each possible path in $G$ is a possible covering set, which implies that $K=O(k\log|V|)$ . ∎∎

Combining Lemma 1 with the above-mentioned application of the Ford-Fulkerson algorithm, we obtain our first result:

Theorem 2.1

Given a DAG $G=(V,E)$ of width $k$ , the MPC problem on $G$ can be solved in time $O(k|E|\log|V|)$ .

3 The dynamic programming framework

In this section we give an overview of the main ideas of our approach.

Suppose we have a problem involving DAGs that is solvable, for example by dynamic programming, by traversing the nodes in topological order. Thus, assume also that a partial solution at each node $v$ is obtainable from all (and only) nodes of the DAG that can reach $v$ , plus some other independent objects, such as another sequence. Furthermore, suppose that at each node $v$ we need to query (and maintain) a data structure $\mathcal{T}$ that depends on $R^{-}(v)$ and such that the answer $\mathsf{Query}(R^{-}(v))$ at $v$ is decomposable as:

[TABLE]

In the above, the sets $R^{-}_{i}(v)$ are such that $R^{-}(v)=\bigcup_{i}R^{-}_{i}(v)$ , they are not necessarily disjoint, and $\bigoplus$ is some operation on the queries, such as min or max, that does not assume disjointness. It is understood that after the computation at $v$ , we need to update $\mathcal{T}$ . It is also understood that once we have updated $\mathcal{T}$ at $v$ , we cannot query $\mathcal{T}$ for a node before $v$ in topological order, because it would give an incorrect answer.

The first idea is to decompose the graph into a path cover $P_{1},\dots,P_{K}$ . As such, we decompose the computation only along these paths, in light of (1). We replace a single data structure $\mathcal{T}$ with $K$ data structures $\mathcal{T}_{1},\dots,\mathcal{T}_{K}$ , and perform the operation from (1) on the results of the queries to these $K$ data structures.

Our second idea concerns the order in which the nodes on these $K$ paths are processed. Because the answer at $v$ depends on $R^{-}(v)$ , we cannot process the nodes on the $K$ paths (and update the corresponding $\mathcal{T}_{i}$ ’s) in an arbitrary order. As such, for every path $i$ and every node $v$ , we distinguish the last node on path $i$ that reaches $v$ (if it exists). We will call this node $\mathtt{last2reach}[v,i]$ . See Figure 1 for an example. We note that this insight is the same as in [21], which symmetrically identified the first node on a chain $i$ that can be reached from $v$ (a chain is a subsequence of a path). The following observation is the first ingredient for using the decomposition (1).

Observation 1

Let $P_{1},\dots,P_{K}$ be a path cover of a DAG $G$ , and let $v\in V(G)$ . Let $R_{i}$ denote the set of nodes of $P_{i}$ from its beginning until $\mathtt{last2reach}[v,i]$ inclusively (or the empty set, if $\mathtt{last2reach}[v,i]$ does not exist). Then $R^{-}(v)=\bigcup_{i=1}^{K}R_{i}$ .

Proof.

It is clear that $\bigcup_{i=1}^{K}R_{i}\subseteq R^{-}(v)$ . To show the reverse inclusion, consider a node $u\in R^{-}(v)$ . Since $P_{1},\dots,P_{K}$ is a path cover, then $u$ appears on some $P_{i}$ . Since $u$ reaches $v$ , then $u$ appears on $P_{i}$ before $\mathtt{last2reach}[v,i]$ , or $u=\mathtt{last2reach}[v,i]$ . Therefore $u$ appears on $R_{i}$ , as desired. ∎∎

This allows us to identify, for every node $u$ , a set of forward propagation links $\mathtt{forward}[u]$ , where $(v,i)\in\mathtt{forward}[u]$ holds for any node $v$ and index $i$ with $\mathtt{last2reach}[v,i]=u$ . These propagation links are the second ingredient in the correctness of the decomposition. Once we have computed the correct value at $u$ , we update the corresponding data structures $\mathcal{T}_{i}$ for all paths $i$ to which $u$ belongs. We also propagate the query value of $\mathcal{T}_{i}$ in the decomposition (1) for all nodes $v$ with $(v,i)\in\mathtt{forward}[u]$ . This means that when we come to process $v$ , we have already correctly computed all terms in the decomposition (1) and it suffices to apply the operation $\bigoplus$ to these terms.

The next lemma shows how to compute the values $\mathtt{last2reach}$ (and, as a consequence, all forward propagation links), also by dynamic programming.

Lemma 2

Let $G=(V,E)$ be a DAG, and let $P_{1},\dots,P_{K}$ be a path cover of $G$ . For every $v\in V$ and every $i\in[1..K]$ , we can compute $\mathtt{last2reach}[v,i]$ in overall time $O(K|E|)$ .

Proof.

For each $P_{i}$ and every node $v$ on $P_{i}$ , let $\mathtt{index}[v,i]$ denote the position of $v$ in $P_{i}$ (say, starting from $1$ ). Our algorithm actually computes $\mathtt{last2reach}[v,i]$ as the index of this node in $P_{i}$ . Initially, we set $\mathtt{last2reach}[v,i]=-1$ for all $v$ and $i$ . At the end of the algorithm, $\mathtt{last2reach}[v,i]=-1$ will hold precisely for those nodes $v$ that cannot be reached by any node of $P_{i}$ . We traverse the nodes in topological order. For every $i\in[1..K]$ , we do as follows: if $v$ is on $P_{i}$ , then we set $\mathtt{last2reach}[v,i]=\mathtt{index}[v,i]$ . Otherwise, we compute by dynamic programming $\mathtt{last2reach}[v,i]$ as $\max_{u\in N^{-}(v)}\mathtt{last2reach}[u,i]$ . ∎∎

An immediate application of Theorem 2.1 and of the values $\mathtt{last2reach}[v,i]$ is for solving reachability queries (see Appendix 0.A.1). Another simple application is an extension of the longest increasing subsequence (LIS) problem to labeled DAGs (Appendix 0.A.2).

The LIS problem, the LCS problem of Section 4, as well as the CLC problem of Section 5 make use of the following standard data structure (see e.g. [25, p.20]).

Lemma 3

The following two operations can be supported with a balanced binary search tree $\mathcal{T}$ in time $O(\log n)$ , where $n$ is the number of leaves in the tree.

•

$\mathsf{update}(k,\mathtt{val})$ : For the leaf $w$ with $\mathtt{key}(w)=k$ , update $\mathtt{value}(w)=\mathtt{val}$ .

•

$\mathsf{RMaxQ}(l,r)$ : Return $\max_{w\>:\>l\leq\mathtt{key}(w)\leq r}\mathtt{value}(w)$ (Range Maximum Query).

Moreover, the balanced binary search tree can be built in $O(n)$ time, given the $n$ pairs $(\mathtt{key},\mathtt{value})$ sorted by component $\mathtt{key}$ .

4 The LCS problem

Consider a labeled DAG $G=(V,E,\ell,\Sigma)$ and a sequence $S\in\Sigma^{*}$ , where $\Sigma$ is an ordered alphabet. We say that the longest common subsequence (LCS) between $G$ and $S$ is a longest subsequence $C$ of any path label in $G$ such that $C$ is also a subsequence of $S$ .

We will modify the LIS algorithm of Appendix 0.A.2 minimally to find a LCS between a DAG $G$ and a sequence $S$ . The description is self-contained yet, for the interest of page limit, more dense than the LIS algorithm derivation. The purpose is to give an example of the general MPC-framework with fewer technical details than required in the main result of this paper concerning co-linear chaining.

For any $c\in\Sigma$ , let $S(c)$ denote set $\{j\mid S[j]=c\}$ . For each node $v$ and each $j\in S(\ell(v))$ , we aim to store in $\mathtt{LLCS}[v,j]$ the length of the longest common subsequence between $S[1..j]$ and any label of path ending at $v$ , among all subsequences having $\ell(v)=S[j]$ as the last symbol.

Assume we have a path cover of size $K$ and $\mathtt{forward}[u]$ computed for all $u\in V$ . Assume also we have mapped $\Sigma$ to $\{0,1,2,\ldots,|S|+1\}$ in $O((|V|+|S|)\log|S|)$ time (e.g. by sorting the symbols of $S$ , binary searching labels of $V$ , and then relabeling by ranks, with the exception that, if a node label does not appear in $S$ , it is replaced by $|S|+1$ ).

Let $\mathcal{T}_{i}$ be a search tree of Lemma 3 initialized with key-value pairs $(0,0)$ , $(1,-\infty)$ , $(2,-\infty)$ , …, $(|S|,-\infty)$ , for each $i\in[1..K]$ . The algorithm proceeds in fixed topological ordering on $G$ . At a node $u$ , for every $(v,i)\in\mathtt{forward}[u]$ we now update an array $\mathtt{LLCS}[v,j]$ for all $j\in S(\ell(v))$ as follows: $\mathtt{LLCS}[v,j]=\max(\mathtt{LLCS}[v,j],\mathcal{T}_{i}.\mathsf{RMaxQ}(0,j-1)+1)$ . The update step of $\mathcal{T}_{i}$ when the algorithm reaches a node $v$ , for each covering path $i$ containing $v$ , is done as $\mathcal{T}_{i}.\mathsf{update}(j^{\prime},\mathtt{LLCS}[v,j^{\prime}])$ for all $j^{\prime}$ with $j^{\prime}<j$ and $j^{\prime}\in S(\ell(v))$ . Initialization is handled by the $(0,0)$ key-value pair so that any $(v,j)$ with $\ell(v)=S[j]$ can start a new common subsequence.

The final answer to the problem is $\max_{v\in V,j\in S(\ell(v))}\mathtt{LLCS}[v,j]$ , with the actual LCS to be found with a standard traceback. The algorithm runs in $O((|V|+|S|)\log|S|+K|M|\log|S|)$ time, where $M=\{(v,j)\mid v\in V,j\in[1..|S|],\ell(v)=S[j]\}$ , and assuming a cover of $K$ paths is given. Notice that $|M|$ can be $\Omega(|V||S|)$ . With Theorem 2.1 plugged in, the total running time becomes $O(k|E|\log|V|+(|V|+|S|)\log|S|+k|M|\log|S|)$ . Since the queries on the data structures are semi-open, one can use the more efficient data structure from [16] to improve the bound to $O(k|E|\log|V|+(|V|+|S|)\log|S|+k|M|\log\log|S|)$ . The following theorem summarizes this result.

Theorem 4.1

Let $G=(V,E,\ell,\Sigma)$ be a labeled DAG of width $k$ , and let $S\in\Sigma^{\ast}$ , where $\Sigma$ is an ordered alphabet. We can find a longest common subsequence between $G$ and $S$ in time $O(k|E|\log|V|+(|V|+|S|)\log|S|+k|M|\log\log|S|)$ .

When $G$ is a path, the bound improves to $O((|V|+|S|)\log|S|+|M|\log\log|S|)$ , which nearly matches the fastest sparse dynamic programming algorithm for the LCS on two sequences [11] (with a difference in $\log\log$ -factor due to a different data structure, which does not work for this order of computation).

5 Co-linear chaining

We start with a formal definition of the co-linear chaining problem (see Figure 2 for an illustration), following the notions introduced in [25, Section 15.4].

Problem 1 (Co-linear chaining (CLC))

Let $T$ and $R$ be two sequences over an alphabet $\Sigma$ , and let $M$ be a set of $N$ pairs $([x..y],[c..d])$ . Find an ordered subset $S=s_{1}s_{2}\cdots s_{p}$ of pairs from $M$ such that

•

$s_{j-1}.y<s_{j}.y$ and $s_{j-1}.d<s_{j}.d$ , for all $1\leq j\leq p$ , and

•

$S$ maximizes the ordered coverage of $R$ , defined as

[TABLE]

The definition of ordered coverage between two sequences is symmetric, as we can simply exchange the roles of $T$ and $R$ . But when solving the CLC problem between a DAG and a sequence, we must choose whether we want to maximize the ordered coverage on the sequence $R$ or on the DAG $G$ . We will consider the former variant.

First, we define the following precedence relation:

Definition 1

Given two paths $P_{1}$ and $P_{2}$ in a DAG $G$ , we say that $P_{1}$ precedes $P_{2}$ , and write $P_{1}\prec P_{2}$ , if one of the following conditions holds:

•

$P_{1}$ and $P_{2}$ do not share nodes and there is a path in $G$ from the endpoint of $P_{1}$ to the startpoint of $P_{2}$ , or

•

$P_{1}$ and $P_{2}$ have a suffix-prefix overlap and $P_{2}$ is not fully contained in $P_{1}$ ; that is, if $P_{1}=(a_{1},\dots,a_{i})$ and $P_{2}=(b_{1},\dots,b_{j})$ then there exists a $k\in\{\max(1,2+i-j),\dots,i\}$ such that $a_{k}=b_{1}$ , $a_{k+1}=b_{2}$ , …, $a_{i}=b_{1+i-k}$ .

We then extend the formulation of Problem 1 to handle a sequence and a DAG.

Problem 2 (CLC between a sequence and a DAG)

Let $R$ be a sequence, let $G$ be a labeled DAG, and let $M$ be a set of $N$ pairs $(P,[c..d])$ , where $P$ is a path in $G$ and $c\leq d$ are non-negative integers. Find an ordered subset $S=s_{1}s_{2}\cdots s_{p}$ of pairs from $M$ such that

•

for all $2\leq j\leq p$ , it holds that $s_{j-1}.P\prec s_{j}.P$ and $s_{j-1}.d<s_{j}.d$ , and

•

$S$ maximizes the ordered coverage of $R$ , analogously defined as $\mathtt{coverage}(R,S)=|\{i\in[1..|R|]\>|\>i\in[s_{j}.c..s_{j}.d]\text{ for some }1\leq j\leq p\}|$ .

To illustrate the main technique of this paper, let us for now only seek solutions where paths in consecutive pairs in a solution do not overlap in the DAG. Suffix-prefix overlaps between paths turn out to be challenging; we will postpone this case until Appendix 0.B.

Problem 3 (Overlap-limited CLC between a sequence and a DAG)

Let $R$ be a sequence, let $G$ be a labeled DAG, and let $M$ be a set of $N$ pairs $(P,[c..d])$ , where $P$ is a path in $G$ and $c\leq d$ are non-negative integers (with the interpretation that $\ell(P)$ matches $R[c..d]$ ). Find an ordered subset $S=s_{1}s_{2}\cdots s_{p}$ of pairs from $M$ such that

•

for all $2\leq j\leq p$ , it holds that there is a non-empty path from the last node of $s_{j-1}.P$ to the first node of $s_{j}.P$ and $s_{j-1}.d<s_{j}.d$ , and

•

$S$ maximizes $\mathtt{coverage}(R,S)$ .

First, let us consider a trivial approach to solve Problem 3. Assume we have ordered in $O(|E|+N)$ time the $N$ input pairs as $M[1],M[2],\dots,M[N]$ , so that the endpoints of $M[1].P,M[2].P,\dots,M[N].P$ are in topological order, breaking ties arbitrarily. We denote by $C[j]$ the maximum ordered coverage of $R[1..M[j].d]$ using the pair $M[j]$ and any subset of pairs from $\{M[1],M[2],\dots,M[j-1]\}$ .

Theorem 5.1

Overlap-limited co-linear chaining between a sequence and a labeled DAG $G=(V,E,\ell,\Sigma)$ (Problem 3) on $N$ input pairs can be solved in $O((|V|+|E|)N)$ time.

Proof.

First, we reverse the edges of $G$ . Then we mark the nodes that correspond to the path endpoints for every pair. After this preprocessing we can start computing the maximum ordered coverage for the pairs as follows: for every pair $M[j]$ in topological order of their path endpoints for $j\in\{1,\dots,N\}$ we do a depth-first traversal starting at the startpoint of path $M[j].P$ . Note that since the edges are reversed, the depth-first traversal checks only pairs whose paths are predecessors of $M[j].P$ .

Whenever we encounter a node that corresponds to the path endpoint of a pair $M[j^{\prime}]$ , we first examine whether it fulfills the criterion $M[j^{\prime}].d<M[j].c$ (call this case (a)). The best ordered coverage using pair $M[j]$ after all such $M[j^{\prime}]$ is then

[TABLE]

where $C[j]^{\prime}$ is the best ordered coverage when using pairs $M[j^{\prime}]$ last.

If pair $M[j^{\prime}]$ does not fulfill the criterion for case (a), we then check whether $M[j].c\leq M[j^{\prime}].d\leq M[j].d$ (call this case (b)). The best ordered coverage using pair $M[j]$ after all such $M[j^{\prime}]$ with $M[j^{\prime}].c<M[j].c$ is then

[TABLE]

Inclusions, i.e. $M[j].c\leq M[j^{\prime}].c$ , can be left computed incorrectly in $C^{\textrm{b}}[j]$ , since there is a better or equally good solution computed in $C^{\textrm{a}}[j]$ or $C^{\textrm{b}}[j]$ that does not use them [1].

Finally, we take $C[j]=\max(C^{\textrm{a}}[j],C^{\textrm{b}}[j])$ . Depth-first traversal takes $O(|V|+|E|)$ time and is executed $N$ times, for $O((|V|+|E|)N)$ total time. ∎∎

However, we can do significantly better than $O((|V|+|E|)N)$ time. In the next sections we will describe how to apply the framework from Section 3 here.

5.1 Co-linear chaining on sequences revisited

We now describe the dynamic programming algorithm from [1] for the case of two sequences, as we will then reuse this same algorithm in our MPC approach.

First, sort input pairs in $M$ by the coordinate $y$ into the sequence $M[1]$ , $M[2]$ , …, $M[N]$ , so that $M[i].y\leq M[j].y$ holds for all $i<j$ . This will ensure that we consider the overlapping ranges in sequence $T$ in the correct order. Then, we fill a table $C[1..N]$ analogous to that of Theorem 5.1 so that $C[j]$ gives the maximum ordered coverage of $R[1..M[j].d]$ using the pair $M[j]$ and any subset of pairs from $\{M[1],M[2],\dots,M[j-1]\}$ . Hence, $\max_{j}C[j]$ gives the total maximum ordered coverage of $R$ .

Consider Equations (2) and (3). Now we can use an invariant technique to convert these recurrence relations so that we can exploit the range maximum queries of Lemma 3:

[TABLE]

For these to work correctly, we need to have properly updated the trees $\mathcal{T}$ and $\mathcal{I}$ for all $j^{\prime}\in[1..j-1]$ . That is, we need to call $\mathcal{T}.\mathsf{update}(M[j^{\prime}].d,C[j^{\prime}])$ and $\mathcal{I}.\mathsf{update}(M[j^{\prime}].d,C[j^{\prime}]-M[j^{\prime}].d)$ after computing each $C[j^{\prime}]$ . The running time is $O(N\log N)$ .

Figure 2 illustrates the optimal chain on our schematic example. This chain can be extracted by modifying the algorithm to store traceback pointers.

Theorem 5.2 ([36, 1])

Problem 1 on $N$ input pairs can be solved in the optimal $O(N\log N)$ time.

5.2 Co-linear chaining on DAGs using a minimum path cover

Let us now modify the above algorithm to work with DAGs, using the main technique of this paper.

Theorem 5.3

Problem 3 on a labeled DAG $G=(V,E,\ell,\Sigma)$ of width $k$ and a set of $N$ input pairs can be solved in time $O(k|E|\log|V|+kN\log N)$ time.

Proof.

Assume we have a path cover of size $K$ and $\mathtt{forward}[u]$ computed for all $u\in V$ . For each path $i\in[1..K]$ , we create two binary search trees $\mathcal{T}_{i}$ and $\mathcal{I}_{i}$ . As a reminder, these trees correspond to coverages for pairs that do not, and do overlap, respectively, on the sequence. Moreover, recall that in Problem 3 we do not consider solutions where consecutive paths in the graph overlap.

As keys, we use $M[j].d$ , for every pair $M[j]$ , and additionally the key 0. The value of every key is initialized to $-\infty$ .

After these preprocessing steps, we process the nodes in topological order, as detailed in Algorithm 1. If node $v$ corresponds to the endpoint of some $M[j].P$ , we update the trees $\mathcal{T}_{i}$ and $\mathcal{I}_{i}$ for all covering paths $i$ containing node $v$ . Then we follow all forward propagation links $(w,i)\in\mathtt{forward}[v]$ and update $C[j]$ for each path $M[j].P$ starting at $w$ , taking into account all pairs whose path endpoints are in covering path $i$ . Before the main loop visits $w$ , we have processed all forward propagation links to $w$ , and the computation of $C[j]$ has taken all previous pairs into account, as in the naive algorithm, but now indirectly through the $K$ search trees. Exceptions are the pairs overlapping in the graph, which we omit in this problem statement. The forward propagation ensures that the search tree query results are indeed taking only reachable pairs into account. While $C[j]$ is already computed when visiting $w$ , the startpoint of $M[j].P$ , the added coverage with the pair is updated to the search trees only when visiting the endpoint.

There are $NK$ forward propagation links, and both search trees are queried in $O(\log N)$ time. All the search trees containing a path endpoint of a pair are updated. Each endpoint can be contained in at most $K$ paths, so this also gives the same bound $2NK$ on the number of updates. With Theorem 2.1 plugged in, we have $K=k$ and the total running time becomes $O(k|E|\log|V|+kN\log N)$ . ∎∎

Appendix 0.B shows how to handle the case of path overlaps, giving the following result:

Theorem 5.4

Let $G=(V,E,\ell,\Sigma)$ be a labeled DAG and let $M$ be a set of $N$ pairs of the form $(P,[c..d])$ . The algorithms from Theorems 5.1 and 5.3 can be modified to solve Problem 2 with additional time $O(L\log^{2}|V|)$ or $O(L+\mathtt{\#overlaps})$ , where $L$ is at most the input length and $\mathtt{\#overlaps}$ is the number of overlaps between the input paths.

The bound $O(L+\mathtt{\#overlaps})$ comes as a direct consequence of using a generalized suffix tree to compute the overlaps in advance [34, proof of Theorem 2]. With the overlaps given, one can process each in constant time to see if they give the maximum for $C[j]$ . The other bound $O(L\log^{2}|V|)$ comes from backward searching a subpath in a concatenation of all subpaths using FM-index. At each step a range can be identified that contains all subpaths with suffix-prefix overlap with the current one. This range limits the keys suitable in the binary search trees storing the already computed coverage values, and thus a two-dimensional range search is needed (see Appendix 0.B).

6 Discussion and experiments

For applying our solutions to Problem 2 in practice, one first needs to find the alignment anchors. As explained in the problem formulation, alignment anchors are such pairs $(P,[c..d])$ where $P$ is a path in $G$ and $\ell(P)$ matches $R[c..d]$ . With sequence inputs, such pairs are usually taken to be maximal exact matches (MEMs) and can be retrieved in small space in linear time [5, 4]. It is largely an open problem how to retrieve MEMs between a sequence and a DAG efficiently: The case of length-limited MEMs is studied in [37], based on an extension of [38] with features such as suffix tree functionality. On the practical side, anchor finding has already been incorporated into tools for conducting alignment of a sequence to a DAG [24, 29].

For the purpose of demonstrating the efficiency of our MPC-approach applied to co-linear chaining, we implemented a MEM-finding routine based on simple dynamic programming. We leave it for future work to incorporate a practical procedure (e.g. like those in [24, 29]). We tested the time improvement of our MPC-approach (Theorem 5.3) over the trivial algorithm (Theorem 5.1) on the sequence graphs of annotated human genes. Out of all the 62219 genes in the HG38 annotation for all human chromosomes, we singled out 8628 genes such that their sequence graph had at least 5000 nodes. Out of these, we picked 500 genes at random.

The size of the graphs for these 500 genes varied between $|V|=5023$ and $|V|=30959$ vertices. Their width, i.e., the number of paths in the MPC, varied between $k=1$ and $k=15$ . (The number of graphs for each value of $k$ is listed in the column #graphs of the top table of Figure 3.) The number of anchors, $N$ , for patterns of length 1000 varied between $10^{1}$ and $10^{5}$ . As shown in Figure 3, with small values of $N$ , our MPC-based co-linear chaining algorithm was twice as fast as the trivial algorithm. When values of $N$ were increased from $10^{1}$ to $10^{5}$ , the difference increased to two orders of magnitude.

The improved efficiency when compared to the naive approach gives reason to believe a practical sequence-to-DAG aligner can be engineered along the algorithmic foundations given here. Future work includes the incorporation of a practical anchor-finding method, and testing whether the complete scheme improves transcript prediction through improved finding of exon chains [23].

On the theoretical side, it remains open whether the MPC algorithm could benefit from a better initial approximation and/or one that is faster to compute. More generally, it remains open whether the overall bound $O(k|E|\log|V|)$ for the MPC problem can be improved.

Acknowledgements.

We thank Djamal Belazzougui for pointers on backward step on large alphabet and Gonzalo Navarro for pointing out the connection to pattern matching on hypertexts. This work was funded in part by the Academy of Finland (grant 274977 to AIT and grants 284598 and 309048 to AK and to VM), and by Futurice Oy (to TP).

Appendix 0.A Two simple applications

0.A.1 Reachability queries

An immediate application of Theorem 2.1 and of the values $\mathtt{last2reach}[v,i]$ is for solving reachability queries. If we have all these $O(k|V|)$ values, then we can answer in constant time whether a node $y$ is reachable from a node $x$ , as in [21]: we check $\mathtt{index}[x,i]\leq\mathtt{index}[\mathtt{last2reach}[y,i],i]$ , where $\mathtt{index}$ was defined in the proof of Lemma 2, $i$ is a path containing $x$ , and we take by convention $\mathtt{index}[-1,i]=-1$ . Recall also that reachability queries in an arbitrary graph can be reduced to solving reachability queries in its DAG of strongly connected components, because nodes in the same component are pairwise reachable. See Table 1 for existing tradeoffs for solving reachability queries.111Note that [7] incorrectly attributes to [21] query time $O(\log k)$ , and as a consequence [39, 22] incorrectly mention query time $O(\log k)$ for [7].

Corollary 1

Let $G=(V,E)$ be an arbitrary directed graph and let the width of its DAG of strongly connected components be $k$ . In time $O(k|E|\log|V|)$ we can construct from $G$ an index of size $O(k|V|)$ , so that for any $x,y\in V$ we can answer in $O(1)$ time whether $y$ is reachable from $x$ .

0.A.2 The LIS problem

The LIS problem asks us to delete the minimum number of values from an input sequence $s_{1}\cdots s_{n}$ such that remaining values form a strictly increasing series of values. Here the input sequence is assumed to come from an ordered alphabet $\Sigma$ . For example, on input sequence $1,4,2,3,7,5,6$ , from the alphabet $\Sigma=\{1,2,3,4,5,6,7\}$ , the unique optimal solution is $1,2,3,5,6$ . Such a longest increasing subsequence can be found in the optimal $O(n\log n)$ time [14].

This optimal algorithm works as follows. We first map $\Sigma$ to a subset of $\{1,2,\ldots,n\}$ with an order-preserving mapping, in $O(n\log n)$ time (by e.g., sorting the sequence elements, and relabeling by the ranks in the order of distinct values). We then store, at every index $i$ of the input sequence, the value $\mathtt{LLIS}[i]$ defined as the length of the longest strictly increasing subsequence ending at $i$ and using the $i$ -th symbol. The values $\mathtt{LLIS}[i]$ can be computed by dynamic programming, by storing all previous key-value pairs $(s_{j},\mathtt{LLIS}[j])$ in a search tree $\mathcal{T}$ as in Lemma 3, and querying $\mathcal{T}.\mathsf{RMaxQ}(0,s_{i}-1)$ .

Consider the following extension of the LIS problem to a labeled DAG $G=(V,E,\ell,\Sigma)$ of width $k$ . For a path $P=(v_{1},\dots,v_{t})$ in $G$ , let the label of $P$ , denoted $\ell(P)$ , be the concatenation of the labels of the nodes of $P$ , namely $\ell(v_{1})\cdots\ell(v_{t})$ . Among all paths $P$ in $G$ , and among all subsequences of $\ell(P)$ , we need to find a longest strictly increasing subsequence.

We now explain how to extend the previous dynamic programming algorithm for this problem. We analogously map $\Sigma$ to a subset of $\{1,2,\ldots,|V|\}$ with an order-preserving mapping in $O(|V|\log|V|)$ time, as above. Recall that we assume $V=\{1,\dots,|V|\}$ , where $1,\dots,|V|$ is a topological order. Assume also that we have $K$ paths to cover $V$ and $\mathtt{forward}[u]$ is computed for all $u\in V$ .

For each node $v$ , we aim to analogously compute $\mathtt{LLIS}[v]$ as the length of a longest strictly increasing subsequence of the labels of all paths ending at $v$ , with the property that $\ell(v)$ is the last element of this subsequence.

For each $i\in[1..K]$ , we let $\mathcal{T}_{i}$ be a search tree as in Lemma 3, initialized with key-value pairs $(0,0),(1,-\infty),(2,-\infty),\ldots,(|V|,-\infty)$ . The algorithm proceeds in the fixed topological ordering. Assume now that we are at some position $u$ , and have already updated all search trees associated with the covering paths going through $u$ . For every $(v,i)\in\mathtt{forward}[u]$ , we update $\mathtt{LLIS}[v]=\max(\mathtt{LLIS}[v],\mathcal{T}_{i}.\mathsf{RMaxQ}(0,\ell(v)-1)+1)$ . Once the algorithm reaches $v$ in the topological ordering, value $\mathtt{LLIS}[v]$ has been updated from all $u^{\prime}$ such that $(v,i)\in\mathtt{forward}[u^{\prime}]$ . It remains to show how to update each $\mathcal{T}_{i}$ when reaching $v$ , for all covering paths $i$ on which $v$ occurs. This is done as $\mathcal{T}_{i}.\mathsf{update}(\ell(v),\mathtt{LLIS}[v])$ . Initialization is handled by the $(0,0)$ key-value pair so that any position can start a new increasing subsequence. Figure 4 shows an example.

The final answer to the problem is $\max_{v\in V}\mathtt{LLIS}[v]$ , with the actual LIS to be found with a standard traceback. The algorithm runs in $O(K|V|\log|V|)$ time. With Theorem 2.1 plugged in, we have $K=k$ and the total running time becomes $O(k|E|\log|V|+k|V|\log|V|)=O(k|E|\log|V|)$ , under our assumption $|E|\geq|V|-1$ . The following theorem summarizes this result.

Theorem 0.A.1

Let $G=(V,E,\ell,\Sigma)$ be a labeled DAG of width $k$ , where $\Sigma$ is an ordered alphabet. We can find a longest increasing subsequence in $G$ in time $O(k|E|\log|V|)$ .

When the DAG is just a labeled path with $|E|=|V|-1$ (modeling the standard LIS problem), then the algorithm from Lemma 1 returns one path ( $K=1$ ). The complexity is then $O(|V|\log|V|)$ , matching the best possible bound for the standard LIS problem [14].

Appendix 0.B Co-linear chaining with path overlaps

We now consider how to extend the algorithms we developed for Problem 3 to work for the more general case of Problem 2, where overlaps between paths are allowed in a solution. The detection and merging of such path overlaps has been studied in [34], and we tailor a similar approach for our purposes.

We use an FM-index [13] tailored for large alphabets [19], and a two-dimensional range search tree [6] modified to support range maximum queries. The former is used for obtaining all ranges $[i^{\prime}..i]$ in the coverage array $C$ such that all input pairs $M[i^{\prime}],\ldots,M[i]$ have a path $M[i^{\prime\prime}].P$ , $i^{\prime}\leq i^{\prime\prime}\leq i$ , overlapping with the path $M[j].P$ of $j$ -th input pair $M[j]$ . Here, the endpoint of the $j$ -th input pair is at node $v$ visited in topological order. This implies that the paths of the input pairs $[i^{\prime}..i]$ have already been visited, and thus, by induction, that $C[i^{\prime}..i]$ values have been correctly computed (subject to the modification we are about to study). The sequence ranges $[M[i^{\prime\prime}].c..M[i^{\prime\prime}].d]$ for all $i^{\prime\prime}\in[i^{\prime}..i]$ may be arbitrarily located with respect to interval $[M[j].c..M[j].d]$ , so we need to maintain an analogous mechanism with search trees of type $\mathcal{T}$ and $\mathcal{I}$ as in our co-linear chaining algorithm based on a path cover. This time we cannot, in advance, separate the input pairs to $K$ paths with different search trees, but we have a dynamic setting, with interval $[i^{\prime}..i]$ deciding which values should be taken into account. This is where a two-dimensional range search tree is used to support these queries in $O(\log^{2}N)$ time: Figure 5 illustrates this.

In what follows we show that $O(L)$ queries are sufficient to take all overlaps into account throughout the algorithm execution (holding for both the trivial algorithm and for the one based on a path cover), where $L=\sum_{i}|M[i].P|$ —the sum of the path lengths—is at most the total input length. The construction will actually induce an order for the input pairs such that $O(L)$ queries are sufficient: Since the other parts of the algorithms do not use the order of input pairs directly, we can safely reorganize the input accordingly.

With this introduction, we are ready to consider how all the intervals $[i^{\prime}..i]$ related to $j$ -th pair are obtained. We build in $O(L\log\log|V|)$ time the FM-index version proposed in [19] of sequences $T=(\prod_{i}\#(M[i].P)^{-1})\#$ , where $\#$ is a symbol not in alphabet $\{1,2,\ldots,|V|\}$ and considered smaller than other symbols, e.g. $\#=0$ , and $X^{-1}$ denotes the reverse $X[|X|]X[|X-1]]\cdots X[1]$ of $X$ .

For our purposes it is sufficient to know that the FM-index of $T$ , when given an interval $I(X)$ corresponding to lexicographically-ordered suffixes that start with $X$ , can determine the interval $I(cX)$ in $O(\log\log|V|)$ time [19]. This operation is called backward step.

We use the index to search $M[j].P$ in the forward direction by searching its reverse with backward steps. Consider we have found interval $I((M[j].P[1..k])^{-1})$ , for some $k$ , such that backward step $I(\#(M[j].P[1..k])^{-1})$ results in a non-empty interval $[i^{\prime}..i]$ . This interval $[i^{\prime}..i]$ corresponds to all suffixes of $T$ that have $\#(M[j].P[1..k])^{-1}$ as a prefix. That is, $[i^{\prime}..i]$ corresponds to input pairs whose path suffix have a length $k$ overlap with the path prefix of $j$ -th input pair. For this interval to match with coverage array $C$ , we just need to rearrange the input pairs according to their order in the first $N$ rows of the array storing the lexicographic order of suffixes of $T$ .

Since each backward step on the index may induce a range search on exactly one interval $[i^{\prime}..i]$ , the running time is dominated by the range queries.222A simple wavelet tree based FM-index would provide the same bound, but in case the range search part is later improved, we used the best bound for the subroutine. On the other hand, this also gives the bound $L$ on the number of range queries, as claimed earlier.

Alternatively, one can omit the expensive range queries and process each overlapping pair separately, to compute in constant time its contribution to $C[j]$ . This gives another bound $O(L\log\log|V|+\mathtt{\#overlaps})$ , where $\mathtt{\#overlaps}$ is the number of overlaps between the input paths. This can be improved to $O(L+\mathtt{\#overlaps})$ by using a generalized suffix tree to compute the overlaps in advance [34, proof of Theorem 2].

The result is summarized in Theorem 5.4 at page 5.4.

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Mohamed Ibrahim Abouelhoda. A Chaining Algorithm for Mapping c DNA Sequences to Multiple Genomic Sequences. In 14th International Symposium on String Processing and Information Retrieval , volume 4726 of LNCS , pages 1–13. Springer, 2007.
2[2] Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network Flows: Theory, Algorithms, and Applications . Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993.
3[3] Amihood Amir, Moshe Lewenstein, and Noa Lewenstein. Pattern matching in hypertext. J. Algorithms , 35(1):82–99, 2000.
4[4] Djamal Belazzougui. Linear time construction of compressed text indices in compact space. In Proc. Symposium on Theory of Computing STOC 2014 , pages 148–193. ACM, 2014.
5[5] Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Versatile Succinct Representations of the Bidirectional Burrows-Wheeler Transform. In Proc. 21st Annual European Symposium on Algorithms (ESA 2013) , volume 8125 of LNCS , pages 133–144. Springer, 2013.
6[6] Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. Computational Geometry: Algorithms and Applications . Springer-Verlag TELOS, Santa Clara, CA, USA, 3rd ed. edition, 2008.
7[7] Y. Chen and Y. Chen. An Efficient Algorithm for Answering Graph Reachability Queries. In 2008 IEEE 24th International Conference on Data Engineering , pages 893–902, April 2008.
8[8] Y. Chen and Y. Chen. On the graph decomposition. In 2014 IEEE Fourth International Conference on Big Data and Cloud Computing , pages 777–784, Dec 2014.