SPRISS: Approximating Frequent $k$-mers by Sampling Reads, and Applications
Diego Santoro, Leonardo Pellegrina, Fabio Vandin

TL;DR
SPRISS is an efficient sampling-based algorithm that approximates frequent $k$-mers in large sequencing datasets, significantly reducing computational time while maintaining accuracy for various genomic analyses.
Contribution
It introduces a simple sampling scheme combined with existing $k$-mer counting methods to efficiently approximate frequent $k$-mers in large datasets.
Findings
SPRISS achieves high accuracy in approximating frequent $k$-mers.
It reduces analysis time by a significant margin.
Effective in metagenomic dataset comparison and discriminative $k$-mer identification.
Abstract
The extraction of -mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all -mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of -mers to be considered. However, in several applications, only frequent -mers, which are -mers appearing in a relatively high proportion of the data, are required by the analysis. In this work we present SPRISS, a new efficient algorithm to approximate frequent -mers and their frequencies in next-generation sequencing data. SPRISS employs a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Machine Learning and Algorithms
