The colored longest common prefix array computed via sequential scans
F. Garofalo, G. Rosone, M. Sciortino, D. Verzotto

TL;DR
This paper introduces the colored longest common prefix array (cLCP), a new data structure that enables efficient, alignment-free comparison of large biological sequence datasets through sequential scans in semi-external memory.
Contribution
The paper presents the cLCP data structure and demonstrates its ability to efficiently compute sequence similarities without extensive in-memory processing.
Findings
cLCP can be computed via sequential scans in semi-external memory
The approach effectively solves the multi-string Average Common Substring problem
Experimental results confirm the method's efficiency and practicality
Abstract
Due to the increased availability of large datasets of biological sequences, the tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most of the alignment-free approaches require the computation of statistics of the sequences in the dataset. Such computations become impractical in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-external memory. By using cLCP, we propose an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · DNA and Biological Computing
