A practical fpt algorithm for Flow Decomposition and transcript assembly

Kyle Kloster; Philipp Kuinke; Michael P. O'Brien; Felix Reidl,; Fernando S\'anchez Villaamil; Blair D. Sullivan; and Andrew van der Poel

arXiv:1706.07851·cs.DS·August 31, 2017

A practical fpt algorithm for Flow Decomposition and transcript assembly

Kyle Kloster, Philipp Kuinke, Michael P. O'Brien, Felix Reidl,, Fernando S\'anchez Villaamil, Blair D. Sullivan, and Andrew van der Poel

PDF

TL;DR

This paper introduces a practical fixed-parameter tractable algorithm for Flow Decomposition in DAGs, crucial for transcript assembly, and demonstrates its efficiency and exactness on RNA-sequence data.

Contribution

The paper presents the first practical linear FPT algorithm for Flow Decomposition, along with an implementation that outperforms heuristics in transcript assembly tasks.

Findings

01

The solver finds exact solutions efficiently on RNA-sequence data.

02

The algorithm's runtime is competitive with state-of-the-art heuristics.

03

Hardness results show no polynomial kernels and NP-hardness for related problems.

Abstract

The Flow Decomposition problem, which asks for the smallest set of weighted paths that "covers" a flow on a DAG, has recently been used as an important computational step in transcript assembly. We prove the problem is in FPT when parameterized by the number of paths by giving a practical linear fpt algorithm. Further, we implement and engineer a Flow Decomposition solver based on this algorithm, and evaluate its performance on RNA-sequence data. Crucially, our solver finds exact solutions while achieving runtimes competitive with a state-of-the-art heuristic. Finally, we contextualize our design choices with two hardness results related to preprocessing and weight recovery. Specifically, $k$ -Flow Decomposition does not admit polynomial kernels under standard complexity assumptions, and the related problem of assigning (known) weights to a given set of paths is NP-hard.

Figures16

Click any figure to enlarge with its caption.

Figure 16

Tables1

Table 1. Table 1 . Summary of the RNA sequencing dataset [ 15 ] . Statistics are for non-trivial instances. Columns 4 through 6 give averages; column 7 (8) reports the percent of non-trivial ground truth decompositions that are optimal (non-optimal) size. Because Toboggan timed out on some instances, these percentages do not sum to 100. The high percentage of instances with ground truth of minimum size supports the use of Flow Decomposition as a model for transcript assembly.

dataset	instances	non-trivial	nodes	deg	$k$	optimal	non-optimal
zebrafish	1,549,373	445,880	18.49	2.27	2.32	99.91%	0.053%
mouse	1,316,058	473,185	18.67	2.37	2.75	99.40%	0.074%
human	1,169,083	529,523	18.79	2.41	2.83	99.49%	0.043%
human-salmon	13,300,893	7,153,472	20.54	2.55	3.74	94.39%	0.035%
all	17,335,407	8,602,060	20.22	2.52	3.55	95.27%	0.039%

Equations18

f (a) = i = 1 \sum k w_{i} \cdot 1_{p_{i}} (a)

f (a) = i = 1 \sum k w_{i} \cdot 1_{p_{i}} (a)

i \in g^{' - 1} (a) \sum w_{i} = f (a) .

i \in g^{' - 1} (a) \sum w_{i} = f (a) .

2 π k (\frac{k}{e})^{k} e^{1/ (12 k + 1)} \leq k! \leq 2 π k (\frac{k}{e})^{k} e^{1/12 k} .

2 π k (\frac{k}{e})^{k} e^{1/ (12 k + 1)} \leq k! \leq 2 π k (\frac{k}{e})^{k} e^{1/12 k} .

{ℓ k} ℓ!

{ℓ k} ℓ!

\leq \frac{k}{2 k - ℓ} (\frac{k}{e})^{k} (\frac{e}{k - ℓ})^{k - ℓ} e^{\frac{1}{12 k} - \frac{1}{12 ( k - ℓ ) + 1}} \frac{ℓ ^{k}}{ℓ ^{ℓ}}

\leq k \frac{k ^{k} ℓ ^{k - l}}{( k - ℓ ) ^{k - ℓ} e ^{ℓ}} .

k \frac{( α k ) ^{k - α k}}{( k - α k ) ^{k - α k} e ^{α k}} k^{k} \leq k ((\frac{α}{1 - α})^{(1 - α)} e^{- α})^{k} k^{k} \leq k (0.649 k)^{k},

k \frac{( α k ) ^{k - α k}}{( k - α k ) ^{k - α k} e ^{α k}} k^{k} \leq k ((\frac{α}{1 - α})^{(1 - α)} e^{- α})^{k} k^{k} \leq k (0.649 k)^{k},

\frac{4 ^{k^{2}}}{k ! k ^{k}} \cdot k (0.649 k)^{k} = k \frac{4 ^{k^{2}} 0.64 9 ^{k}}{k !} .

\frac{4 ^{k^{2}}}{k ! k ^{k}} \cdot k (0.649 k)^{k} = k \frac{4 ^{k^{2}} 0.64 9 ^{k}}{k !} .

\frac{4 ^{k^{2}} k ^{2.5 k} 0.64 9 ^{k}}{k !} k^{o (k)} \cdot (n + λ) \leq 4^{k^{2}} k^{1.5 k} 1.76 5^{k} k^{o (k)} \cdot (n + λ)

\frac{4 ^{k^{2}} k ^{2.5 k} 0.64 9 ^{k}}{k !} k^{o (k)} \cdot (n + λ) \leq 4^{k^{2}} k^{1.5 k} 1.76 5^{k} k^{o (k)} \cdot (n + λ)

k \geq ∣ F (C_{1}) \cap F (C_{2}) ∣ + \frac{2}{3} (∣ F_{1} ∣ + ∣ F_{2} ∣),

k \geq ∣ F (C_{1}) \cap F (C_{2}) ∣ + \frac{2}{3} (∣ F_{1} ∣ + ∣ F_{2} ∣),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\reserveinserts

28

\CopyrightKyle Kloster, Philipp Kuinke, Michael O’Brien, Felix Reidl, Fernando Sánchez Villaamil,

Under Logo Blair D. Sullivan, Andrew van der Poel

A practical fpt algorithm for Flow Decomposition and transcript assembly

Kyle Kloster

North Carolina State University, USA