Text Indexing for Long Patterns using Locally Consistent Anchors

Lorraine A. K. Ayad; Grigorios Loukides; Solon P. Pissis

arXiv:2407.11819·cs.DS·July 17, 2024·1 cites

Text Indexing for Long Patterns using Locally Consistent Anchors

Lorraine A. K. Ayad, Grigorios Loukides, Solon P. Pissis

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel text index based on locally consistent anchors that efficiently balances space, query time, and construction costs, especially when a lower bound on pattern length is known, outperforming traditional indexes in practical scenarios.

Contribution

It proposes a new index structure using lc-anchors that achieves optimal trade-offs across key measures and provides both average-case and worst-case guarantees, a first in this regime.

Findings

01

Outperforms classic indexes like suffix trees, suffix arrays, and FM-index in experiments.

02

Offers average-case guarantees for all four measures.

03

Provides a new index with worst-case guarantees based on lc-anchors.

Abstract

In many real-world database systems, a large fraction of the data is represented by strings: sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (i) index space; (ii) query time; (iii) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lorrainea/rrBDA-index
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Rough Sets and Fuzzy Logic · Text and Document Classification Technologies