Fast Label Extraction in the CDAWG

Djamal Belazzougui; Fabio Cunial

arXiv:1707.08197·cs.DS·September 27, 2017

Fast Label Extraction in the CDAWG

Djamal Belazzougui, Fabio Cunial

PDF

TL;DR

This paper presents optimized algorithms for label extraction and pattern matching in the CDAWG, significantly improving efficiency for highly repetitive datasets like genomic collections.

Contribution

It introduces a method to reduce pattern occurrence counting, locating, and label reading times from logarithmic factors to linear, leveraging straight-line programs for label extraction.

Findings

01

Pattern counting time reduced to O(m)

02

Occurrence locating time reduced to O(m+occ)

03

Label reading time reduced to O(k)

Abstract

The compact directed acyclic word graph (CDAWG) of a string $T$ of length $n$ takes space proportional just to the number $e$ of right extensions of the maximal repeats of $T$ , and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which $e$ grows significantly more slowly than $n$ . We reduce from $O (m lo g lo g n)$ to $O (m)$ the time needed to count the number of occurrences of a pattern of length $m$ , using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from $O (m lo g lo g n + occ)$ to $O (m + occ)$ in the time needed to locate all the $occ$ occurrences of the pattern. We also reduce from $O (k lo g lo g n)$ to $O (k)$ the time needed to read the $k$ characters of the label of an edge of the suffix tree of $T$ , and we reduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.