Fast Label Extraction in the CDAWG
Djamal Belazzougui, Fabio Cunial

TL;DR
This paper presents optimized algorithms for label extraction and pattern matching in the CDAWG, significantly improving efficiency for highly repetitive datasets like genomic collections.
Contribution
It introduces a method to reduce pattern occurrence counting, locating, and label reading times from logarithmic factors to linear, leveraging straight-line programs for label extraction.
Findings
Pattern counting time reduced to O(m)
Occurrence locating time reduced to O(m+occ)
Label reading time reduced to O(k)
Abstract
The compact directed acyclic word graph (CDAWG) of a string of length takes space proportional just to the number of right extensions of the maximal repeats of , and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which grows significantly more slowly than . We reduce from to the time needed to count the number of occurrences of a pattern of length , using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from to in the time needed to locate all the occurrences of the pattern. We also reduce from to the time needed to read the characters of the label of an edge of the suffix tree of , and we reduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
