Lightweight LCP Construction for Very Large Collections of Strings

Anthony J. Cox; Fabio Garofalo; Giovanna Rosone; Marinella Sciortino

arXiv:1605.04098·cs.DS·May 16, 2016

Lightweight LCP Construction for Very Large Collections of Strings

Anthony J. Cox, Fabio Garofalo, Giovanna Rosone, Marinella Sciortino

PDF

TL;DR

This paper introduces extLCP, a lightweight, disk-efficient algorithm for computing the longest common prefix array, Burrows-Wheeler transform, and suffix array of large string collections, useful in biological data analysis.

Contribution

The paper presents the first lightweight, disk-based algorithm for simultaneous LCP array and BWT computation on large string collections of any length.

Findings

01

ExtLCP performs sequential disk scans, reducing memory usage.

02

The algorithm requires at most twice the output size in disk space.

03

Experimental results show competitive performance on real biological data.

Abstract

The longest common prefix array is a very advantageous data structure that, combined with the suffix array and the Burrows-Wheeler transform, allows to efficiently compute some combinatorial properties of a string useful in several applications, especially in biological contexts. Nowadays, the input data for many problems are big collections of strings, for instance the data coming from "next-generation" DNA sequencing (NGS) technologies. In this paper we present the first lightweight algorithm (called extLCP) for the simultaneous computation of the longest common prefix array and the Burrows-Wheeler transform of a very large collection of strings having any length. The computation is realized by performing disk data accesses only via sequential scans, and the total disk space usage never needs more than twice the output size, excluding the disk space required for the input. Moreover,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.