SPRISS: Approximating Frequent $k$-mers by Sampling Reads, and   Applications

Diego Santoro; Leonardo Pellegrina; Fabio Vandin

arXiv:2101.07117·q-bio.QM·January 19, 2021·Bioinform.

SPRISS: Approximating Frequent $k$-mers by Sampling Reads, and Applications

Diego Santoro, Leonardo Pellegrina, Fabio Vandin

PDF

Open Access

TL;DR

SPRISS is an efficient sampling-based algorithm that approximates frequent $k$-mers in large sequencing datasets, significantly reducing computational time while maintaining accuracy for various genomic analyses.

Contribution

It introduces a simple sampling scheme combined with existing $k$-mer counting methods to efficiently approximate frequent $k$-mers in large datasets.

Findings

01

SPRISS achieves high accuracy in approximating frequent $k$-mers.

02

It reduces analysis time by a significant margin.

03

Effective in metagenomic dataset comparison and discriminative $k$-mer identification.

Abstract

The extraction of $k$ -mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all $k$ -mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of $k$ -mers to be considered. However, in several applications, only frequent $k$ -mers, which are $k$ -mers appearing in a relatively high proportion of the data, are required by the analysis. In this work we present SPRISS, a new efficient algorithm to approximate frequent $k$ -mers and their frequencies in next-generation sequencing data. SPRISS employs a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Machine Learning and Algorithms