Even better correction of genome sequencing data

Maciej Dlugosz; Sebastian Deorowicz; Marek Kokot

arXiv:1703.00690·q-bio.GN·March 3, 2017

Even better correction of genome sequencing data

Maciej Dlugosz, Sebastian Deorowicz, Marek Kokot

PDF

Open Access

TL;DR

This paper presents an enhanced version of the RECKONER error correction tool for Illumina genome sequencing data, significantly reducing computation time and introducing a new method for selecting the optimal k-mer length, validated on large genomic datasets.

Contribution

The authors improve RECKONER's workflow for faster correction and propose a novel method for determining the optimal k-mer size in error correction.

Findings

01

10-fold reduction in computation time

02

Effective correction on human and maize genomes

03

Validated on Illumina HiSeq and MiSeq data

Abstract

We introduce an improved version of RECKONER, an error corrector for Illumina whole genome sequencing data. By modifying its workflow we reduce the computation time even 10 times. We also propose a new method of determination of $k$ -mer length, the key parameter of $k$ -spectrum-based family of correctors. The correction algorithms are examined on huge data sets, i.e., human and maize genomes for both Illumina HiSeq and MiSeq instruments.

Figures7

Click any figure to enlarge with its caption.

Tables3

Table 1. Table 1 : Data simulation profiles sources

Accession	Read	Instrument	Average base
number	length	model	error probability
SRR1203044	$100 bp$	HiSeq	2.0 %
SRR1802178	$250 bp$	MiSeq	0.3 %

Table 2. Table 2 : Correctors versions and parameters

BFC	version v1	genome_length	2991110000	2222330000	2991110000
Algorithm	Version	Parameter	Human	Maize	Human
Algorithm	Version	Parameter	HiSeq	HiSeq	MiSeq
RECKONER	1.0	k	36	42	78
RECKONER	0.2.1	k	36	42	78
BLESS	1.02	k	36	42	72
BFC	BFC-ht	genome_length	2991110000	2222330000	2991110000
Musket	1.1	k	33	33	72
Musket	1.1	genome_length	38649008822	29850675652	40692109119

Table 3. Table 3 : Data sizes

Parameter	Human HiSeq	Maize HiSeq	Human MiSeq
Gzipped file size	50 GiB	38.9 GiB	42.9 GiB
Number of reads	598,200,000	444,466,000	239,280,000

Equations4

\begin{array}[]{cclcccl}k&=&(a\log_{2}{g}-b)c,&&a&=&0.8,\\ b&=&9+p,&&c&=&2+\ell/100,\end{array}

\begin{array}[]{cclcccl}k&=&(a\log_{2}{g}-b)c,&&a&=&0.8,\\ b&=&9+p,&&c&=&2+\ell/100,\end{array}

gain = (∣ T P ∣ - ∣ F P ∣) / (∣ T P ∣ + ∣ F N ∣) .

gain = (∣ T P ∣ - ∣ F P ∣) / (∣ T P ∣ + ∣ F N ∣) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · Chromosomal and Genetic Variations · Algorithms and Data Compression

Full text

11institutetext: Institute of Informatics, Silesian University of Technology,

Akademicka 16, 44-100 Gliwice, Poland

{maciej.dlugosz,sebastian.deorowicz,marek.kokot}@polsl.pl

http://sun.aei.polsl.pl/REFRESH/

Even better correction

of genome sequencing data

Maciej Długosz

Sebastian Deorowicz

Marek Kokot

Abstract

We introduce an improved version of RECKONER, an error corrector for Illumina whole genome sequencing data. By modifying its workflow we reduce the computation time even 10 times. We also propose a new method of determination of $k$ -mer length, the key parameter of $k$ -spectrum-based family of correctors. The correction algorithms are examined on huge data sets, i.e., human and maize genomes for both Illumina HiSeq and MiSeq instruments.

1 Introduction

For several years next-generation sequencing (NGS) technologies, like 454 pyrosequencing, Complete Genomics, Illumina, Ion Torrent, Pacific Biosciences [11], and Oxford Nanopore [23] have been dominating the field of DNA analysis. Their advantages over Sanger method [19], that they superseded, include extremely high throughput (even $600\,\text{Gb}$ per instrument run), low sequencing cost (even $41\text{\$ } $per Gb) [[18](#bib.bib18)] and, in case of Oxford Nanopore MinION, virtually domestically low cost ($ 1000\text{$}$) and size of the instruments [23].

Such advantages have enabled many sequencing data applications. High availability of huge amount of data have enormous practical potential. De novo assembly, reassembly, single nucleotide polymorphisms (SNPs) detection, metagenomics, personalized medicine analysis are only the examples of applications variety. Large projects aiming of sequencing thousands of large genomes like the 1000 Genomes Project [21] or the Genome 10K Project [5] became possible.

The price of NGS advantages is relatively low quality of reads (short fragments of sequenced genomes) they produce. This poses a challenge in development of algorithms performing downstream analysis. Some instruments produce short reads, e.g., reads generated by popular Illumina HiSeq2000 of length up to $150\,\text{bp}$ (base pairs) [18] are shorter than many of repeats present in DNA. The distribution of reads over the genome is far from uniform, which causes overrepresentation of some fragments and often unpresence of the other ones [11, 18]. Sequencing errors cause deformations of reads by altering some symbols, inserting alien fragments into reads, or removing their fragments [11].

All of those and other issues have negative impact on utilization of sequencing data. They affect accuracy of algorithms processing the reads, e.g., by generating false positives in SNPs detection or causing false connections between contigs in de novo assembly. The algorithms processing such data are complex to address those problems, have large time complexity and consume a lot of memory.

Newer instruments still reduce sequencing costs and enhance throughput and data quality, but they are still not ideal. To reduce impact of the aforementioned difficulties the experiments can be run with higher sequencing coverage (average number of reads containing a single base from a genome). It is also possible to mix reads from various instruments to exploit different advantages of them, e.g., short, but of good quality, Illumina reads can be used with long, intensely erroneous Pacific Biosciences reads. Alas, such approach often poses only a workaround of the problem, since it causes another difficulties. Sequencing cost, even though currently strongly decreased, is still considerable in limited scientific budgets. The onefold cost of sequencing is enlarged by sustained costs of storing and sharing huge amounts (even hundreds of GBs per single experiment) of a redundant data.

The problem of sequencing errors can be partially solved by involving specialized correction algorithms. In the recent years many of them were developed. The correctors detect potential errors and try to correct them, or sometimes reject strongly damaged parts of the data by truncating or removing the reads.

Yang et al. [22] classified the correction algorithms according to the approach of modeling the problem to: (i) $k$ -spectrum-based, (ii) suffix-tree/array-based, and (iii) multiple-sequence-alignment-based. The idea of the first category is to extract all fragments of reads of length $k$ ( $k$ -mers) and make use of data redundancy as the majority of $k$ -symbol-long fragments of the sequenced genome would be represented by a number of $k$ -mers in reads. The rare $k$ -mers are deemed as erroneous and altered in the reads to the most similar, but more frequent ones. The suffix-tree/array-based algorithms also extract read substrings, but stores them in suffix data structures, which allows to utilize different-length substrings simultaneously. The multiple-sequence-alignment-based algorithms select from the input files such reads, that seem to origin from the same genome fragments. Then they perform multiple sequence alignment to match them each other. It permits to find the consensus value of the particular read bases.

In [3] we proposed RECKONER, a $k$ -spectrum-based error correction algorithm. We performed comparison of it and the state-of-the-art algorithms: RACER [9], BLESS 2 [6], Blue [4], Musket [15], Lighter [13], Trowel [14], Pollux [16], BFC [12], Ace [9] from the group of $k$ -spectrum-based algorithms and Karect [1] from multiple-sequence-alignment-based algorithms.

In this paper we present significantly faster and giving better corrections version of RECKONER. Firstly, we reduced the time of processing of gzipped input files. Such files are typical in practice due to huge sizes of the input dataset. Secondly, we improved the method of automatic $k$ -mer length determination, which has positive influence on the quality of corrections if user will not define this parameter. Thirdly, we redesigned the parallelization scheme to make better use of modern multicore processors.

The rest of the paper is organized as follows. Section 2 presents the proposed algorithm. In Section 3 we show and discuss the results of the experiments. The last section concludes the paper.

2 Our algorithm

2.1 RECKONER workflow

The workflow of original RECKONER is shown in Fig. 1. Firstly, RECKONER traverses all reads present in the input files and distributes them to groups (chunks), that will be processed separately by different threads. The positions of the first reads in chunks are stored in a queue. Simultaneously, the minimum of read quality values is used for determination of the level of quality coding. If it is lower than $59$ (the smallest value of Phred+64 scale), it is supposed that the scale is Phred+33, and Phred+64 otherwise.

Then KMC [2], is used to determine the number of occurrences of all $k$ -mers ( $k$ -mer counts) and to build a database of $k$ -mers. The database is used to calculate a histogram of $k$ -mer counts and basing on this to determine the threshold of counts, which distinguish between probably erroneous and correct $k$ -mers. The erroneous $k$ -mers are removed from the database by KMC tools [10].

The main correction is performed by many threads maintained by OpenMP. Every thread picks a consecutive chunk from the queue. Then, it opens the input file and performs correction of the reads of the current chunk. As a result it saves to a temporary disk file description of introduced changes. For every chunk one file is created. The threads work until any chunk is available.

In the last stage the results are integrated. The input file is traversed once again. Now, according to the contents of consecutive temporary files, RECKONER introduces changes to the reads, which are then stored in the output file.

As we show in [3], RECKONER is one of the fastest read correction algorithms. The experiments were, however, performed on uncompressed data. As the stages of reads checking and results integrating are single-threaded, the time of performing them constitute a significant fraction of the complete processing. This effect is visible especially for compressed data, as in the last stage the output reads have to be compressed by a single thread. Moreover, the implemented method of chunks generation requires remembering of many positions in the input file. For gzipped input the time of seeking these positions in a file could be significant, especially when the files are large. Due to huge size of the input data, typically the only reasonable approach is to correct them without prior decompression, which enlarges the importance of addressing the problem.

The following subsections present the improvements introduced to RECKONER to remove its drawbacks.

2.2 Checking reads elimination

To eliminate the necessity of one input file traversal we decided to resign from the chunkifying stage. Instead, a short stage to determine the quality indicators level was added. Currently it is made as follows. A number of the first reads in the file are taken until difference between one of quality symbol value and the minimum value of Phred+33 ( $33$ ) or the maximum value of Phred+64 ( $104$ ) is lower than $5$ . The level containing the found value is chosen. Usually the number of reads necessary to check is only a few.

To distribute the reads between threads we utilized a solution similar to the one used in KMC. Before running the error correction stage an additional thread is created. It opens the input file and gets consecutive packages of data, fits them to contain only the entire reads and places them in buffers. The correcting threads take the buffers and correct the reads they contain.

2.3 Result integrating optimization

To optimize the last stage we decided to move the results integrating into the error correction stage. The idea behind that was to perform the output compression parallely. Currently, after performing correction of one chunk, the correcting thread stores the corrected reads in a (usually compressed) temporary file. The integration is performed in an additional thread, which concatenates the previously generated files. Correcting threads, after creating the temporary files, put information about the ordering of the processed chunk to a priority queue, which guarantees the original ordering of the chunks. Invariability of reads ordering in the output is required especially when processing paired-end reads.

2.4 Other changes

The original RECKONER has a simple algorithm for determination of $k$ -mer length, which is the most important parameter having a huge impact on quality of correction. The method is loosely based on logarithmic regression with parameters chosen empirically. Unfortunately, the parameters were determined on experiments with reads obtained with Mason [7], which generates reads properly, but their accompanying quality indicators are not adequate to the assumed quality profile of the reads. As RECKONER utilizes the indicators intensively, the results could be unreliable. At this moment for read generation we use Art [8], which accordingly to our observations generates the quality indicators properly.

Currently to compute the $k$ -mer length we propose the following empirical formula. It is based on results shown in Section 3:

[TABLE]

where $a$ and $b$ are logarithmic regression coefficients, $c$ is an adjustment for different read length, $\ell$ is mean read length, $g$ is genome size and $p$ is the average probability of error in %; $g$ and $p$ are calculated with the method shown in the supplementary material of [3]. Moreover, $k$ is lower- and upper-limited respectively by $k_{\min}=20$ and $k_{\max}=0.2\ell+30$ .

3 Experimental results

3.1 Data sets and algorithms

In [3] we presented the tests of correctors efficacy for maize genome, which is huge and highly repetitive. The conclusion was that correctors work poorly for such data, some of them introduce even more errors than properly correct. But as we said in Section 2.4, the employed read simulator (Mason) we used generated unreliable quality indicators. Therefore in the present article we decided to reexamine the correctors for maize reads obtained with Art [8], which generates the indicators in compliance with our expectations. Additionally we perform tests for human reads also obtained with Art.

We used our own profiles of errors basing of the real read sets. Details of the profiles are given in Table 1. For all datasets we set the sequencing coverage to $20$ . In [3] we ranked the correctors in terms of both quality and resource requirements, which ordered the algorithms starting from the best one as follows: RECKONER, BLESS, Blue, Karect, Musket, BFC, Lighter, RACER, Ace, Trowel, Pollux. In this article we selected RECKONER, BLESS, BFC, and Musket for comparison. We omitted Blue and Karect, as they were not able to complete in $12$ hours even for uncompressed data.

For RECKONER, BLESS, and Musket we selected $k$ -mer length by performing preliminary experiments for a few values with step 3. The exact parameters and command lines used for testing are shown in Appendix. RECKONER is implemented in the C++11 programming language. The experiments were run at workstation equipped with four AMD Opteron 6376 CPUs (16 cores each, clocked at 2.3 GHz) and 512 GB RAM running under openSUSE Leap 42.1 x86-64 OS. The algorithms were run to use 64 threads. For compilation we used GCC 4.9.2.

3.2 Correction time

We measured the time consumption of the correctors, including the optimized version of RECKONER. Figure 2 summarizes the results. For both maize and human data, the time reduction for RECKONER is about 10-fold over its former version. Finally, the correction time reached 2–3 hours, which is definitely acceptable for such huge organisms and is only a bit more than the fastest algorithm, BFC. The missing bar of Musket in Fig. 2(a) denotes that it was not able to correct the reads and returned untouched input file.

What is important, the results were obtained for compressed input. All correctors were able to read gzipped files, but only RECKONER and BLESS produced gzipped output files as well. BFC and Musket generated plain files. BLESS performs compression separately after the correction finishes. Thus, additional disk space (potentially huge) is necessary to store both uncompressed and compressed reads.

3.3 Correction efficacy

The corrector efficacy indicator, called gain is defined as follows. Let $\mathcal{TP}$ be a set of erroneous reads which were corrected perfectly, $\mathcal{FP}$ be a set of error-free reads, which were disrupted by a corrector, and $\mathcal{FN}$ be a set of erroneous reads, which were uncorrected at all or were miscorrected. Finally,

[TABLE]

Figure 3 presents the results of quality in terms of gain. As no changes to the correction procedure are introduced, for the user-defined $k$ -mer length both old and optimized RECKONER obtain the same results. They are better than for the competitors. The gains for maize data are worse than for human data, but they are approximately 2–3 times better than shown in [3].

3.4 MiSeq correction

In [3] we presented the results of correction only for HiSeq instruments. As Illumina produces also other instruments, i.e., MiSeq, we performed tests for it as well. Its important advantage are longer reads, in our case equal $250\text{bp}$ . As MiSeq instruments produce low-error-rate reads, the profile with error rate 0.3 % was selected. The computation times and quality results are shown in Fig. 4.

For these tests Musket also was unable to perform the correction. For all other correctors the obtained gains are above 95 and very close. Thus, we can conclude that the tested algorithms fit well also for longer reads. It is an important asset, as MiSeq reads could be used together with HiSeq reads to improve the quality of downstream algorithms by delivering longest reads.

4 Conclusions

In this paper we proposed new version of RECKONER, sequencing data correction algorithm. The main improvements over the previous version are much faster processing and better selection of the key parameter, $k$ -mer length. The proper choice of $k$ is important for all $k$ -spectrum-based correctors, as it has significant impact on the quality of correction. The automatic determination of $k$ is crucial in real-life applications as it is impossible to perform a number of corrections (for different values of $k$ ) to pick the best results. The new version of RECKONER appears to be about ten times faster than its predecessor for gzipped input files which makes it one of the fastest algorithms.

We showed that $k$ -spectrum-based algorithms are able to correct also reads of highly-repeated genomes like maize. Moreover, we showed, that such algorithms are able to correct also MiSeq reads.

Acknowledgement

The work was supported by the Polish National Science Center upon decision DEC-2015/17/B/ST6/01890.

Appendix

To generate reads we placed a real source profile in a directory <profile_dir> and run read Art with the following commands:

art_profiler_illumina <profile_name> <profile_dir>/ fastq

art_illumina -sam -1 <profile_name> -l <read_length> -i <genome> -c <read_number> -o <output> -rs 0 -na

We run correctors with the following commands with the parameters specified in Table 2:

reckoner -kmerlength <k> -prefix . <input_file>

bless -read <input_file> -kmerlength <k> -prefix tmp -gzip

bfc -s <genome_size> -t 64 <input_file> > <output_file>

musket -k <k> <kmers> -p 64 -o <output_file> <input_file>

Sizes of the correctors input data is presented in Table 3.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Allam, A., Kalnis, P., and Solovyev, V.: Karect: Accurate Correction of Substitution, Insertion and Deletion Errors for Next-generation Sequencing Data. Bioinformatics 31(21), 3421–3428 (2015)
2[2] Deorowicz, S., et al.: KMC 2: Fast and resource-frugal k 𝑘 k -mer counting. Bioinformatics 31(10), 1569–1576 (2015)
3[3] Długosz, M., and Deorowicz, S.: RECKONER: read error corrector based on KMC. Bioinformatics, doi: 10.1093/bioinformatics/btw 746 (2017)
4[4] Greenfield, P., et al.: Blue: correcting sequencing errors using consensus and context. Bioinformatics 30(19), 2723–2732 (2014)
5[5] Haussler, D., et al.: Genome 10K: a proposal to obtain whole-genome sequence for 10 000 vertebrate species. Journal of Heredity 100.6, 659–674 (2009)
6[6] Heo, Y., et al.: BLESS 2: accurate, memory-efficient and fast error correction method. Bioinformatics 32(15), 2369–2371 (2016)
7[7] Holtgrewe, M.: Mason—a read simulator for second generation sequencing data. Technical report Freie Universität Berlin (2010)
8[8] Huang, W, et al.: ART: a next-generation sequencing read simulator. Bioinformatics 28(4), 593–594 (2012)