An Improved Filtering Algorithm for Big Read Datasets
Axel Wedemeyer, Lasse Kliemann, Anand Srivastav, Christian Schielke,, Thorsten B. Reusch, Philip Rosenstiel

TL;DR
Bignorm is a new read filtering algorithm that improves speed and quality by incorporating quality scores, significantly reducing dataset size and accelerating genome assembly without compromising assembly quality.
Contribution
We introduce Bignorm, a faster, quality-aware filtering algorithm that outperforms Diginorm in speed while maintaining high assembly quality.
Findings
Removes 97.15% of reads median
Produces high-quality assemblies faster
Maintains competitive assembly quality
Abstract
For single-cell or metagenomic sequencing projects, it is necessary to sequence with a very high mean coverage in order to make sure that all parts of the sample DNA get covered by the reads produced. This leads to huge datasets with lots of redundant data. A filtering of this data prior to assembly is advisable. Titus Brown et al. (2012) presented the algorithm Diginorm for this purpose, which filters reads based on the abundance of their -mers. We present Bignorm, a faster and quality-conscious read filtering algorithm. An important new feature is the use of phred quality scores together with a detailed analysis of the -mer counts to decide which reads to keep. With recommended parameters, in terms of median we remove 97.15% of the reads while keeping the mean phred score of the filtered dataset high. Using the SDAdes assembler, we produce assemblies of high quality from these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Advanced Data Storage Technologies
