Document Listing on Repetitive Collections with Guaranteed Performance

Gonzalo Navarro

arXiv:1707.06374·cs.DS·November 15, 2018

Document Listing on Repetitive Collections with Guaranteed Performance

Gonzalo Navarro

PDF

TL;DR

This paper introduces a new document listing index for repetitive string collections that guarantees efficient worst-case query times and uses space proportional to the size of the repetitive structure, improving performance over previous methods.

Contribution

The paper presents the first document listing index with size $ ilde{O}(n+s)$ and worst-case guarantees, along with novel grammar-based indexes for counting pattern occurrences efficiently.

Findings

01

Index size is $O((n ext{log}\sigma+s ext{log}^2 N) ext{log} D)$ bits.

02

Pattern occurrence counting is achieved in $O(m^2 + m ext{log}^{2+ extepsilon} r)$ time.

03

The index can count occurrences in $O(m ext{log}^{2+ extepsilon} N)$ time for Lempel-Ziv parsed texts.

Abstract

We consider document listing on string collections, that is, finding in which strings a given pattern appears. In particular, we focus on repetitive collections: a collection of size $N$ over alphabet $[1, σ]$ is composed of $D$ copies of a string of size $n$ , and $s$ edits are applied on ranges of copies. We introduce the first document listing index with size $\tilde{O} (n + s)$ , precisely $O ((n lo g σ + s lo g^{2} N) lo g D)$ bits, and with useful worst-case time guarantees: Given a pattern of length $m$ , the index reports the $\ndoc > 0$ strings where it appears in time $O (m lo g^{1 + ϵ} N \cdot \ndoc)$ , for any constant $ϵ > 0$ (and tells in time $O (m lo g N)$ if $\ndoc = 0$ ). Our technique is to augment a range data structure that is commonly used on grammar-based indexes, so that instead of retrieving all the pattern occurrences, it computes useful summaries on them. We show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.