An Improved Filtering Algorithm for Big Read Datasets

Axel Wedemeyer; Lasse Kliemann; Anand Srivastav; Christian Schielke,; Thorsten B. Reusch; Philip Rosenstiel

arXiv:1610.03443·q-bio.GN·October 12, 2016

An Improved Filtering Algorithm for Big Read Datasets

Axel Wedemeyer, Lasse Kliemann, Anand Srivastav, Christian Schielke,, Thorsten B. Reusch, Philip Rosenstiel

PDF

Open Access

TL;DR

Bignorm is a new read filtering algorithm that improves speed and quality by incorporating quality scores, significantly reducing dataset size and accelerating genome assembly without compromising assembly quality.

Contribution

We introduce Bignorm, a faster, quality-aware filtering algorithm that outperforms Diginorm in speed while maintaining high assembly quality.

Findings

01

Removes 97.15% of reads median

02

Produces high-quality assemblies faster

03

Maintains competitive assembly quality

Abstract

For single-cell or metagenomic sequencing projects, it is necessary to sequence with a very high mean coverage in order to make sure that all parts of the sample DNA get covered by the reads produced. This leads to huge datasets with lots of redundant data. A filtering of this data prior to assembly is advisable. Titus Brown et al. (2012) presented the algorithm Diginorm for this purpose, which filters reads based on the abundance of their $k$ -mers. We present Bignorm, a faster and quality-conscious read filtering algorithm. An important new feature is the use of phred quality scores together with a detailed analysis of the $k$ -mer counts to decide which reads to keep. With recommended parameters, in terms of median we remove 97.15% of the reads while keeping the mean phred score of the filtered dataset high. Using the SDAdes assembler, we produce assemblies of high quality from these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Advanced Data Storage Technologies