Lightweight LCP Construction for Next-Generation Sequencing Datasets

Markus J. Bauer; Anthony J. Cox; Giovanna Rosone; Marinella; Sciortino

arXiv:1305.0160·cs.DS·May 2, 2013

Lightweight LCP Construction for Next-Generation Sequencing Datasets

Markus J. Bauer, Anthony J. Cox, Giovanna Rosone, Marinella, Sciortino

PDF

TL;DR

This paper introduces a lightweight, scalable method for constructing the LCP and BWT of large NGS datasets, enabling efficient analysis of hundreds of millions of DNA sequences without requiring large RAM data structures.

Contribution

The paper presents the first lightweight, sequential scan-based algorithm for computing LCP and BWT of massive sequence collections, suitable for human genome-scale data.

Findings

01

Successfully scales to 800 million sequences

02

Operates efficiently with limited RAM

03

Facilitates rapid bioinformatics analyses

Abstract

The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and BWT of very large collections of sequences. Computational results on collections as large as 800 million 100-mers demonstrate that our algorithm scales to the vast sequence collections encountered in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.