Full-text and Keyword Indexes for String Searching

Aleksander Cis{\l}ak

arXiv:1508.06610·cs.DS·August 27, 2015

Full-text and Keyword Indexes for String Searching

Aleksander Cis{\l}ak

PDF

Open Access

TL;DR

This paper reviews full-text and keyword indexes, introduces the FM-bloated index for faster string searches with higher space use, and presents the split index for efficient approximate matching, demonstrating significant speed improvements.

Contribution

It introduces the FM-bloated index with space-speed trade-offs and the split index for fast k-mismatch queries, along with practical implementation insights.

Findings

01

FM-bloated index achieves faster searches with high space cost.

02

Split index efficiently solves 1-error k-mismatch problems.

03

Query times of about 1 microsecond for small dictionaries.

Abstract

In this work, we present a literature review for full-text and keyword indexes as well as our contributions (which are mostly practice-oriented). The first contribution is the FM-bloated index, which is a modification of the well-known FM-index (a compressed, full-text index) that trades space for speed. In our approach, the count table and the occurrence lists store information about selected $q$ -grams in addition to the individual characters. Two variants are described, namely one using $O (n lo g^{2} n)$ bits of space with $O (m + lo g m lo g lo g n)$ average query time, and one with linear space and $O (m lo g lo g n)$ average query time, where $n$ is the input text length and $m$ is the pattern length. We experimentally show that a significant speedup can be achieved by operating on $q$ -grams (albeit at the cost of very high space requirements, hence the name "bloated"). In the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Network Packet Processing and Optimization