Lightweight LCP Construction for Very Large Collections of Strings
Anthony J. Cox, Fabio Garofalo, Giovanna Rosone, Marinella Sciortino

TL;DR
This paper introduces extLCP, a lightweight, disk-efficient algorithm for computing the longest common prefix array, Burrows-Wheeler transform, and suffix array of large string collections, useful in biological data analysis.
Contribution
The paper presents the first lightweight, disk-based algorithm for simultaneous LCP array and BWT computation on large string collections of any length.
Findings
ExtLCP performs sequential disk scans, reducing memory usage.
The algorithm requires at most twice the output size in disk space.
Experimental results show competitive performance on real biological data.
Abstract
The longest common prefix array is a very advantageous data structure that, combined with the suffix array and the Burrows-Wheeler transform, allows to efficiently compute some combinatorial properties of a string useful in several applications, especially in biological contexts. Nowadays, the input data for many problems are big collections of strings, for instance the data coming from "next-generation" DNA sequencing (NGS) technologies. In this paper we present the first lightweight algorithm (called extLCP) for the simultaneous computation of the longest common prefix array and the Burrows-Wheeler transform of a very large collection of strings having any length. The computation is realized by performing disk data accesses only via sequential scans, and the total disk space usage never needs more than twice the output size, excluding the disk space required for the input. Moreover,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
