Linear-size CDAWG: new repetition-aware indexing and grammar compression
Takuya Takagi, Keisuke Goto, Yuta Fujishige, Shunsuke Inenaga, and, Hiroki Arimura

TL;DR
This paper introduces Linear-size CDAWGs (L-CDAWGs), a new self-indexing structure combining CDAWGs and grammar compression, enabling efficient pattern matching and grammar construction for highly repetitive texts.
Contribution
The paper presents L-CDAWGs, a novel indexing method that achieves smaller space and faster pattern matching times for repetitive texts compared to previous approaches.
Findings
L-CDAWGs use $O( ilde e_T \, \log n)$ bits of space.
Pattern matching time is $O(m + occ)$ for constant alphabets.
Constructs an SLP of size $O(\tilde e_T)$ in $O(n + \tilde e_T \log \sigma)$ time.
Abstract
In this paper, we propose a novel approach to combine \emph{compact directed acyclic word graphs} (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with bits of space allowing for -time random and -time sequential accesses to edge labels, and -time pattern matching. Here, is the number of all extensions of maximal repeats in , and are respectively the lengths of the text and a given pattern, is the alphabet size, and is the number of occurrences of the pattern in . The repetitiveness measure is known to be much smaller than the text length for highly repetitive text. For constant alphabets, our L-CDAWGs achieve pattern matching time with …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · semigroups and automata theory · DNA and Biological Computing
