Lightweight LCP Construction for Next-Generation Sequencing Datasets
Markus J. Bauer, Anthony J. Cox, Giovanna Rosone, Marinella, Sciortino

TL;DR
This paper introduces a lightweight, scalable method for constructing the LCP and BWT of large NGS datasets, enabling efficient analysis of hundreds of millions of DNA sequences without requiring large RAM data structures.
Contribution
The paper presents the first lightweight, sequential scan-based algorithm for computing LCP and BWT of massive sequence collections, suitable for human genome-scale data.
Findings
Successfully scales to 800 million sequences
Operates efficiently with limited RAM
Facilitates rapid bioinformatics analyses
Abstract
The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and BWT of very large collections of sequences. Computational results on collections as large as 800 million 100-mers demonstrate that our algorithm scales to the vast sequence collections encountered in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
