Analyzing Big Datasets of Genomic Sequences: Fast and Scalable Collection of k-mer Statistics
Umberto Ferraro Petrillo, Mara Sorella, Giuseppe Cattaneo, Raffaele, Giancarlo, Simona Rombo

TL;DR
This paper introduces FastKmer, a scalable Spark-based method for efficient k-mer counting in large genomic datasets, emphasizing the importance of parameter tuning and workload balancing for optimal performance.
Contribution
The paper presents FastKmer, a novel distributed approach with workload balancing to improve efficiency and scalability in k-mer statistics extraction from large biological sequences.
Findings
FastKmer outperforms existing Big Data methods in speed.
Workload balancing significantly improves scalability.
Careful framework-specific engineering enhances analysis efficiency.
Abstract
Distributed approaches based on the map-reduce programming paradigm have started to be proposed in the bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of map-reduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Data Storage Technologies · Genomics and Phylogenetic Studies
