BOAssembler: a Bayesian Optimization Framework to Improve RNA-Seq   Assembly Performance

Shunfu Mao; Yihan Jiang; Edwin Basil Mathew; Sreeram Kannan

arXiv:1902.05235·q-bio.GN·February 15, 2019·AlCoB

BOAssembler: a Bayesian Optimization Framework to Improve RNA-Seq Assembly Performance

Shunfu Mao, Yihan Jiang, Edwin Basil Mathew, Sreeram Kannan

PDF

Open Access 1 Repo

TL;DR

BOAssembler is a Bayesian Optimization framework that automates hyper-parameter tuning for RNA-Seq assembly tools, significantly enhancing their performance and reliability across diverse datasets.

Contribution

It introduces an end-to-end automatic tuning method for RNA-Seq assemblers using Bayesian Optimization, addressing the challenge of manual hyper-parameter tuning.

Findings

01

Improved assembly performance across multiple datasets

02

Automated tuning reduces time and effort for users

03

Potential to enhance downstream biological analyses

Abstract

High throughput sequencing of RNA (RNA-Seq) can provide us with millions of short fragments of RNA transcripts from a sample. How to better recover the original RNA transcripts from those fragments (RNA-Seq assembly) is still a difficult task. For example, RNA-Seq assembly tools typically require hyper-parameter tuning to achieve good performance for particular datasets. This kind of tuning is usually unintuitive and time-consuming. Consequently, users often resort to default parameters, which do not guarantee consistent good performance for various datasets. Here we propose BOAssembler (https://github.com/olivomao/boassembler), a framework that enables end-to-end automatic tuning of RNA-Seq assemblers, based on Bayesian Optimization principles. Experiments show this data-driven approach is effective to improve the overall assembly performance. The approach would be helpful for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

olivomao/boassembler
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Genomics and Phylogenetic Studies

Full text

\access

Advance Access Publication Date: Day Month Year \appnotesManuscript Category

\corresp

∗ Equal contribution, to whom correspondence should be addressed.

\history

Received on XXXXX; revised on XXXXX; accepted on XXXXX

\editor

Associate Editor: XXXXXXX

BOAssembler: a Bayesian Optimization Framework to Improve RNA-Seq Assembly Performance

Genome Analysis

Shunfu Mao ${}^{\text{\sfb 1}*}$

Yihan Jiang ${}^{\text{\sfb 1}*}$

Edwin Basil Mathew ${}^{\text{\sfb 2}}$ and Sreeram Kannan ${}^{\text{\sfb 1}}$

${}^{\text{\sf 1}}$ Department of Electrical and Computer Engineering, University of Washington, Seattle, USA

${}^{\text{\sf 2}}$ Data Science Program, University of Washington, Seattle, USA

(2015; 2015)

Abstract

Motivation: High throughput sequencing of RNA (RNA-Seq) can provide us with millions of short fragments of RNA transcripts from a sample. How to better recover the original RNA transcripts from those fragments (RNA-Seq assembly) is still a difficult task. For example, RNA-Seq assembly tools typically require hyper-parameter tuning to achieve good performance for particular datasets. This kind of tuning is usually unintuitive and time-consuming. Consequently, users often resort to default parameters, which do not guarantee consistent good performance for various datasets.

Results: Here we propose BOAssembler, a framework that enables end-to-end automatic tuning of RNA-Seq assemblers, based on Bayesian Optimization principles. Experiments show this data-driven approach is effective to improve the overall assembly performance. The approach would be helpful for downstream (e.g. gene, protein, cell) analysis, and more broadly, for future bioinformatics benchmark studies.

Availability: \hrefhttps://github.com/olivomao/boassemblerhttps://github.com/olivomao/boassembler

Contact: \[email protected]@uw.edu, \[email protected] [email protected]

Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction

Sequence assembly is a process to recover the original genomic sequences from their sampled reads. Based on sequencing technology (DNA/RNA) and model genome availability, there are different assembly problems. In this study, we focus on reference-based RNA-Seq assembly, which is the first step to understand gene, protein and cell functions.

Existing reference-based RNA-Seq assemblers (such as Cufflinks Cufflinks , Stringtie Stringtie ) usually align reads onto reference genome first, and based on the alignment build a graph where each node represents a genome region (exon) and each edge represent two node regions are aligned by some reads. They then traverse the graph to find paths as recovered RNA transcripts.

The assembly problem is essentially NP-hard kececioglu_combinatorial_1995 and existing tools resort to heuristic methods. For example, from the graph, Stringtie will extract heaviest paths iteratively. These methods usually require parameter tuning to achieve good performance for particular datasets. Since most users may not understand the meaning of the parameters well and tuning is tedious and time-consuming, they usually end up with default settings. An automatic tuning framework, therefore, is necessary.

In machine learning (ML), Bayesian Optimization (BO) is gaining a surge of interest as its usefulness in tuning hyper-parameters for modern deep learning systems Snoek:2012 . BO is favorable for optimizing objective functions that are expensive to evaluate and are over continuous domains of less than 20 dimensions Frazier2018 . BO has become widely used in most deep learning systems such as Natural Language Processing (NLP) bo_nlp , Reinforcement Learning (RL) bo_gaming , and Channel Coding bo_cc . Depending on algorithms and programming languages, several popular BO packages have been developed, such as GPyOpt gpyopt2016 .

There are limited work to introduce BO into computational biology fields. Recently Quitadadmo2017 applies BO to improve eQTL analysis. To the best of our knowledge, no work has introduced BO to assembly tasks yet, which are fundamentally graph problems with their own unique challenges. To fill this gap, we have developed BOAssembler, which is a framework able to incorporate existing assemblers (such as Stringtie) and BO methods (such as GPyOpt) to assist assembler developers and biologists to spend minimal efforts to have the assemblers’ hyper-parameters automatically fine tuned for particular datasets.

Our contribution as follows:

•

We firstly explore the BO methods in (reference-based RNA-Seq) assembly tasks.

•

Our designed experiments show that BO is overall effective to improve assembly.

•

An open source end-to-end framework (BOAssembler) is provided for the assembly community to use.

2 Methods

In this section we first introduce assemblers, and then explain Bayesian Optimization, and finally describe our BOAssembler framework that combines both.

2.1 Assembler

There are two kinds of RNA-Seq assembly problems: de novo assembly and reference-based assembly. For de novo assembly, we only have RNA-Seq reads, which is common in non-model organisms. For reference-based assembly, there is additional knowledge on the genome of the organism. De novo assembly is appearantly more challenging and typical tools (such as Trinity Grabherr2011 and recently Shannon Kannan2016 ) require much more computational resources and more complicated evaluations. As a first step to bridge assembly and BO, we focus on reference-based RNA-Seq assembler. In particular, we focus on the widely used Stringtie, as recommended in Hayer2015 .

A typical reference-based RNA-Seq assembly includes aligning sampled RNA-Seq reads onto a reference genome using external tools such as STAR Dobin2012 etc. For Stringtie, a (splice) graph will be prepared where each node represents a unique exonic region supported by aligned reads and edges indicate how nodes are bridged by reads. Graph traversal algorithms will be applied to find paths as transcripts to best explain the constraints from graph nodes and edges.

Since assembly problems are NP hard kececioglu_combinatorial_1995 , existing algorithms take a lot of heuristics (bunch of thresholds), assembler performance depends heavily on hyper-parameters. For example in Stringtie, parameter ’-f’ sets a fractional threshold below which predicted transcripts will be discarded, and a lower value encourages transcripts to be retained to improve sensitivity.

Developers of assemblers typically tune parameters by intuition on a few datasets, and offer default parameters for assembler users to use. As assembler’s performance for various datasets are usually parameter dependent, a more systematic method of tuning parameter is needed. Parameters are continous, and not of low dimensions, which makes grid search on all possible combination prohibitive. Random search bergstra:2012 are expensive to guarantee good coverage, which is not favorable to tune parameter for assembler. We propose BO based method, which is a systematic parameter tuning method with limited number of evaluations in the following part.

2.2 Bayesian Optimization

The reference-based assembler together with its evaluation can be represented as an abstract function $f(D,\theta)$ , where $D$ includes both read alignment for assembly and reference transcriptome (a set of ground truth RNA transcripts) for evaluation, and $\theta$ is the parameters of the abstract function with parameters of dimension $d$ . After read alignments are assembled with given parameter $\theta$ , the assembly output (a set of RNA transcripts) will be compared with the reference transcriptome, and the quality of assembly is measured with scalar metrics such as precision $p$ and sensitivity $s$ . $f(D,\theta)$ outputs a score based on $p$ and $s$ . Our goal is to find a global optimal $\theta$ which maximizes $f(D,\theta)$ , with limited evaluations of $f(D,\theta)$ since running assembler is time consuming.

BO aims at maximizing a real-value black-box function $f(\theta)$ with respect to $\theta$ frazier:2018 in a gradient-free approach (here $D$ is fixed). BO consists of a statistical surrogate objective function to model the input-output relationship between $\theta$ and $f(\theta)$ , and an acquisiton function to decide what to sample next. Firstly BO evaluates randomly chosen $K$ datapoints of $\theta$ , and fits the prior statistical objective model. Then BO iteratively updates the posterior model with newly acquired $f(\theta_{k})$ , and selects $\theta_{k+1}$ to evaluate according to posterior. BO is a systematic approach to explore the parameter space according to a Bayesian model with limited allowed evaluations.

Gaussian Process (GP) with Matern Kernel Snoek:2012 is a natural model for statistical objective function. Expected Improvement (EI) is a commonly used acquisition function. Our BOAssembler uses GP with Matern Kernel and EI as our primary BO method. For details, please refer to Supplementary Section 2. The procedure of iterative update, based on GP and EI, is described in Algorithm 1.

2.3 Combine BO and Assembler

2.3.1 The Motivation to Combine

Bayesian optimization works well for black-box gradient-free global optimization with moderate dimensionality. It is favorable to apply BO to optimize assembler parameters due to the following reasons:

•

Empirically BO works well for parameters with moderate dimensionality (less than 20). This is consistent with assemblers which typically have moderate number of parameters. The usage of BO inherently assume the parameters form a Gaussian Process.

•

$f(.)$ is continuous, and the parameter $\theta$ are correlated, and has well-defined feasible set. This meets the requirements of BO.

•

The function $f(.)$ is expensive to evaluate, thus to evaluate all possible combinations of parameters is prohibitive.

•

$f(.)$ is a ’black-box’, while gradient-based optimization methods cannot be applied. Assemblers typically do not have a gradient due to usage of thresholds, which makes black-box method favorable.

A major drawback of BO is its inference time grows cubically with respect to number of iterations snoek:2015 . For assembler the scalability issue is not severe, since we observe convergence typically at $40$ to $50$ iterations, and the major bottleneck is the assembly part.

2.3.2 The Architecture

Figure 1 illustrates the overall architecture of BOAssembler. There are two parts: the assembly part (e.g. $f(D,\theta)$ ) and BO part.

The assembly part wraps up the RNA-Seq reference-based assembler (here Stringtie), which takes fixed read alignment as well as adjustable assembler parameters as input, and outputs assembled RNA transcripts (in gtf format). In addition, the assembly part includes an evaluator block to access the assembly output. Basically, it calls the gffcompare111https://ccb.jhu.edu/software/stringtie/gffcompare.shtml tool, which takes as input the assembly output and reference transcriptome, and outputs sensitivity and precision statistics. The sensitivity is the percentage of reference RNA transcripts that have been correctly recovered, and the precision means the percentage of assembled transcripts that correctly match the reference transcriptome. We further combine the sensitivity and precision (such as F1 score) as $f(D,\theta)$ , to be used by the BO part. The evaluator may also take adjustable parameters as discussed in Section 2.3.3.

The BO part has its theory described in Section 2.2. More concretely, it wraps up existing BO methods (such as GPyOpt). It treats the assembly part as a black box, where the input to the box is the parameters for assembler and evaluator, and the output of the box is the combined performance metric for the assembler (such as F1 score). The BO part will iteratively optimize the parameters for the black box function (e.g. the assembly part) based on the feedback of performance metric.

2.3.3 Metrics to optimize

The assembly part outputs $f(D,\theta)$ , which is a metric score and serves as an input to BO part. In particular, it is defined as a weighted F1 score ( $S_{w}=\frac{\lambda p\times(1-\lambda)s}{\lambda p+(1-\lambda)s}$ ) on top of the evaluator’s output in terms of sensitivity $s$ and precision $p$ of the assembly. $\lambda\in(0,0.5)$ is also BO tunable.

There are several candidate metrics including the mean value ( $S_{m}=\frac{s+p}{2}$ ) and the F1 score ( $S_{F1}=\frac{2sp}{s+p}$ ). We found the BO part tends to overfit either $s$ or $p$ towards $1$ when using $S_{m}$ . Though $S_{F1}$ is able to balance $s$ and $p$ , we find $S_{w}$ is better to improve the final performance of sensitivity and precision. In our experiments, $s$ tends to have a lower value range than $p$ (due to many reference RNA transcripts do not have enough coverage), we hope to reward more for sensitivity improvement but still have gain on precision. Therefore, we come up with the weighted F1 score $S_{w}$ , which uses BO to figure out how much percentage we want to reward especially for the improvement of sensitivity.

2.3.4 Hyper-parameters

Hyper-parameters are applied to assembler and to evaluator ( $\lambda$ ). We focus on numerical (int, float) types. Each hyper-parameter has its name (e.g. ’-f’), type (e.g. float), default value (e.g. $0.1$ ) and range (e.g. $(0.0,1.0)$ ). See Supplementary Section 3 for Stringtie’s example.

2.3.5 Usage and Extension of BOAssembler

To tune parameters for an existing assembler in BOAssembler, the user only needs to provide a small sample of read alignment which can be done by our provided scripts (see Supplementary Section 1). After some iterations, BOAssembler will report suggested parameters and its tuning history (e.g. per iteration’s parameter and metrics) in a log file.

BOAssembler currently uses Stringtie as its default assembler. It supports Cufflinks as well. Extension to use other reference-based RNA-Seq assemblers is also straight forward. The user only needs to follow the Stringtie example, to add a line of Python code in a specified Python file, and to add config file according to our pre-defined format as mentioned in Section 2.3.4 to include the parameters to be tuned.

3 Result and Discussion

3.1 Datasets

Our goal is to use BOAssembler to tune assembler’s hyper-parameters on a smaller dataset, and apply recommended hyper-parameters on a large assembly task. Since the smaller dataset has representitive data of large assembly task, we expect tuned hyper-parameters can overall improve the large assembly task in terms of sensitivity and precision.

We build our results based on simulated datasets, since real datasets lack ground truth and it is hard to judge if an assembled RNA transcript is a false positive, or a new RNA transcript that has yet to be discovered. The simulated datasets are generated based on real ones.

Firstly we prepare three real datasets, including: 132.05M Illumina single end reads (50-bp) sampled from human embryonic stem cells (HESC) (GSE51861, used in ww_data ), 115.36M Illuminar pair end reads (101-bp) sampled from Lymphoblastoid cells (LC) (SRP036136, used in snyder_data ), and 183.53M Illuminar pair end reads (100-bp) sampled from HEK293T (Kidney) cells (SRX541227), previously produced and studied in StringTie Stringtie .

Secondly, we use RSEM Li2011 to generate simulated reads from real datasets. To begin with, we choose LC reference transcripts (containing $207266$ RNA transcripts) as the ground truth reference transcriptome annotations. We then do quantification of real datasets using RSEM and get learned statistics from real datasets. Based on learned statistics, we use RSEM to sample simulated reads from ground truth reference transcriptome. The simulated HESC has 150M 50-bp single-end reads, simulated LC has 150M 101-bp pair end reads and simulated Kidney has 150M 100-bp pair end reads.

Lastly, we use STAR Dobin2012 (2-pass strategy) to align three simulated datasets onto human reference genome (hg19)222http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/. From each alignement (in bam format), we subsample to get smaller alignment files of chromosome15 as fixed datasets for BOAssembler. The small datasets are about $1.5\%$ , $3.1\%$ , and $2.1\%$ of large datasets for HESC, LC and Kidney respectively. We’ve proposed another more complicated sampling method across chromosomes, which offer similar performance as discussed in Supplementary Section 1.

3.2 Procedure

For each dataset, we run BOAssembler on the smaller datasets. The evaluation for metric also uses a subset (e.g. chromosome 15) of reference transcriptome. Each iteration takes around 1 minute, and we typically see convergence of metric score around 40 to 50 iterations. Compared to grid search for possible combinations of 10 to 20 parameters, BOAssembler is much more efficient.

After automatic tuning, BOAssembler will recommend parameters with high metric scores. We then apply these parameters on large datasets, which typically take several hours to finish the assembly tasks using 25 cores of a linux server.

3.3 Experiment Results

Table 3.3 compares the performance of default parameters (Def) and BOAssembler-tuned parameters (Tune) for each simulated dataset, in terms of sensitivity (Sens), precision (Prec). We also list their standard F1 score here since it’s related to the metric BOAssembler tries to optimize. But we’ll focus on sensitivity and precision which are of practical interest.

As Table 3.3 shows, BOAssembler has improved sensitivity, and precision for all small datasets. In particular, HESC small is improved by $16.9\%$ in sensitivity and $27.4\%$ in precision, LC samll is improved by $1.2\%$ in sensitivity and $13.1\%$ in precision, Kidney small is improved by $3.2\%$ in sensitivity and $5.5\%$ in precision. Notice that the real Kidney dataset has been used in Stringtie’s original work, so the default parameters of Stringtie should have been adjusted for this dataset statistics. Still BOAssembler improves the its performance further.

The trend of performance improvement is mostly reflected in assembly tasks on large datasets, which is most interesting to us. In particular, HESC large is improved by $6.5\%$ in sensitivity and $32.2\%$ in precision, Kidney large is improved by $0.3\%$ in sensitivity and $0.6\%$ in precision. LC large has a small loss around $1.2\%$ in sensitivity, but it gains $10\%$ , which is significant, in precision. The experiments show that by tuning hyper-parameters through BOAssembler on small datasets, we are able to improve large assembly tasks overall (though there could be fluctuations) to a smaller extent.

The diminished performance gain of tuned parameters on large datasets, compared to the gain on small ones, may be because of an averaging effects across more variant alignment statistics in large datasets. To better catch up large dataset statistics, we have also prepared small datasets selected from certain regions, the performance improvement trend is similar (see Supplementary Section 1).

By comparing the BOAssembler suggested parameters with assembler’s default ones, we could also gain more insights into the datasets. For example, in HESC small datasets, the parameter ’f’ is suggested to decrease from $0.1$ to [math], this will allow more transcripts of low expression levels to also be considered as assembly output (hereby improve sensitivity). Meanwhile, the parameter ’m’ is suggested to increase from $200$ to $500$ to allow only longer (e.g. at least $500$ ) assembled transcripts to be considered (hereby improve precision).

3.4 Discussion

We expect our study and developed BOAssembler will contribute to the assembly community as follows:

•

For bioinformaticians who develop assembly algorithms, the framework or ideas behind it could provide them with more convenient ways to set default parameters for their assemblers.

•

For biologists who use reference-based RNA-Seq assemblers, BOAssembler can help them improve assembly performance, so they can gain better insights into the datasets, and the improved assembled RNA transcripts will be helpful for downstream gene, protein and cell related analysis.

•

For benchmark work of assemblers, typically several datasets are prepared and different assemblers are compared for their default parameters. BOAssembler or its ideas will help the benchmark work in a fairer basis, since default parameters can not ganrantee consistent good performance across various datasets.

Whereas this is, to our best knowledge, the first efforts to bring assembly and BO together, there are interesting directions future directions.

•

As from experiments, we have observed that the gain of tuned parameters gets diminished for larger datasets, which implies BO tuned parameter overfits to small training dataset. The idea of using additional validation dataset is shown in Supplementary Section 2.4. Since evaluating assembler is expensive, more efficient data subsampling and cross-validation methods to avoid overfitting are an interesting future direction.

•

Another interesting exploration is how to define a metric score that is better than the current weighted F1 score for Baysian Optimization, to better balance sensitivity and precision.

•

There’re many problems in assembly areas (including variant calling) that heavily relay on hyper-parameter turing for better performance. Introduce similar frameworks to these problems shall be helpful.

Funding

This project is funded by NIH R01 Award 1R01HG008164 by NHGRI and NSF CCF Award 1703403.

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Alexander Dobin, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras. Star: ultrafast universal rna-seq aligner. Bioinformatics , 2012.
2(2) Mihaela Pertea, Geo M Pertea, Corina M Antonescu, Tsung-Cheng Chang, Joshua T Mendell, and Steven L Salzberg. Stringtie enables improved reconstruction of a transcriptome from rna-seq reads. Nat Biotech , 33:290–295, 03 2015.
3(3) C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, and L. Pachter. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. , 28(5):511–515, May 2010.
4(4) Manfred G Grabherr, Brian J Haas, Moran Yassour, Joshua Z Levin, Dawn A Thompson, Ido Amit, Xian Adiconis, Lin Fan, Raktima Raychowdhury, Qiandong Zeng, Zehua Chen, Evan Mauceli, Nir Hacohen, Andreas Gnirke, Nicholas Rhind, Federica di Palma, Bruce W Birren, Chad Nusbaum, Kerstin Lindblad-Toh, Nir Friedman, and Aviv Regev. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotechnology , 29(7):644–652, may 2011.
5(5) Sreeram Kannan, Joseph Hui, Kayvon Mazooji, Lior Pachter, and David Tse. Shannon: An information-optimal de novo rna-seq assembler. bio Rxiv , 2016.
6(6) Katharina E. Hayer, Angel Pizarro, Nicholas F. Lahens, John B. Hogenesch, and Gregory R. Grant. Benchmark analysis of algorithms for determining and quantifying full-length m RNA splice forms from RNA-seq data. Bioinformatics , page btv 488, sep 2015.
7(7) J. D. Kececioglu and E. W. Myers. Combinatorial algorithms for DNA sequence assembly. Algorithmica , 13(1-2):7–51, February 1995
8(8) Frazier PI. A tutorial on Bayesian optimization. ar Xiv preprint ar Xiv:1807.02811. 2018 Jul 8.