A bloated FM-index reducing the number of cache misses during the search

Szymon Grabowski; Aleksander Cis{\l}ak

arXiv:1512.01996·cs.DS·December 8, 2015

A bloated FM-index reducing the number of cache misses during the search

Szymon Grabowski, Aleksander Cis{\l}ak

PDF

Open Access

TL;DR

This paper introduces a bloated FM-index variant that significantly reduces cache misses during pattern search by working on q-grams, trading off increased space for improved speed, especially for long patterns.

Contribution

It presents a novel FM-index variant that minimizes cache misses by using q-grams and sorted suffix occurrence lists, achieving faster searches at the cost of large space requirements.

Findings

01

Achieves $O(m/|CL| + \\log n \\log m)$ cache misses in worst case

02

Often several times faster than existing FM-indexes for long patterns

03

Requires substantially more space, up to $O(n \\log^2 n)$ bits

Abstract

The FM-index is a well-known compressed full-text index, based on the Burrows-Wheeler transform (BWT). During a pattern search, the BWT sequence is accessed at "random" locations, which is cache-unfriendly. In this paper, we are interested in speeding up the FM-index by working on $q$ -grams rather than individual characters, at the cost of using more space. The first presented variant is related to an inverted index on $q$ -grams, yet the occurrence lists in our solution are in the sorted suffix order rather than text order in a traditional inverted index. This variant obtains $O (m /∣ C L ∣ + lo g n lo g m)$ cache misses in the worst case, where $n$ and $m$ are the text and pattern lengths, respectively, and $∣ C L ∣$ is the CPU cache line size, in symbols (typically 64 in modern hardware). This index is often several times faster than the fastest known FM-indexes (especially for long…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Image and Video Retrieval Techniques · Error Correcting Code Techniques