Linear-size CDAWG: new repetition-aware indexing and grammar compression

Takuya Takagi; Keisuke Goto; Yuta Fujishige; Shunsuke Inenaga; and; Hiroki Arimura

arXiv:1705.09779·cs.DS·July 28, 2017·1 cites

Linear-size CDAWG: new repetition-aware indexing and grammar compression

Takuya Takagi, Keisuke Goto, Yuta Fujishige, Shunsuke Inenaga, and, Hiroki Arimura

PDF

Open Access

TL;DR

This paper introduces Linear-size CDAWGs (L-CDAWGs), a new self-indexing structure combining CDAWGs and grammar compression, enabling efficient pattern matching and grammar construction for highly repetitive texts.

Contribution

The paper presents L-CDAWGs, a novel indexing method that achieves smaller space and faster pattern matching times for repetitive texts compared to previous approaches.

Findings

01

L-CDAWGs use $O( ilde e_T \, \log n)$ bits of space.

02

Pattern matching time is $O(m + occ)$ for constant alphabets.

03

Constructs an SLP of size $O(\tilde e_T)$ in $O(n + \tilde e_T \log \sigma)$ time.

Abstract

In this paper, we propose a novel approach to combine \emph{compact directed acyclic word graphs} (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with $O (\tilde{e}_{T} lo g n)$ bits of space allowing for $O (lo g n)$ -time random and $O (1)$ -time sequential accesses to edge labels, and $O (m lo g σ + occ)$ -time pattern matching. Here, $\tilde{e}_{T}$ is the number of all extensions of maximal repeats in $T$ , $n$ and $m$ are respectively the lengths of the text $T$ and a given pattern, $σ$ is the alphabet size, and $occ$ is the number of occurrences of the pattern in $T$ . The repetitiveness measure $\tilde{e}_{T}$ is known to be much smaller than the text length $n$ for highly repetitive text. For constant alphabets, our L-CDAWGs achieve $O (m + occ)$ pattern matching time with $O (e_{T}^{r} lo g n)$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · semigroups and automata theory · DNA and Biological Computing