A bloated FM-index reducing the number of cache misses during the search
Szymon Grabowski, Aleksander Cis{\l}ak

TL;DR
This paper introduces a bloated FM-index variant that significantly reduces cache misses during pattern search by working on q-grams, trading off increased space for improved speed, especially for long patterns.
Contribution
It presents a novel FM-index variant that minimizes cache misses by using q-grams and sorted suffix occurrence lists, achieving faster searches at the cost of large space requirements.
Findings
Achieves $O(m/|CL| + \\log n \\log m)$ cache misses in worst case
Often several times faster than existing FM-indexes for long patterns
Requires substantially more space, up to $O(n \\log^2 n)$ bits
Abstract
The FM-index is a well-known compressed full-text index, based on the Burrows-Wheeler transform (BWT). During a pattern search, the BWT sequence is accessed at "random" locations, which is cache-unfriendly. In this paper, we are interested in speeding up the FM-index by working on -grams rather than individual characters, at the cost of using more space. The first presented variant is related to an inverted index on -grams, yet the occurrence lists in our solution are in the sorted suffix order rather than text order in a traditional inverted index. This variant obtains cache misses in the worst case, where and are the text and pattern lengths, respectively, and is the CPU cache line size, in symbols (typically 64 in modern hardware). This index is often several times faster than the fastest known FM-indexes (especially for long…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Image and Video Retrieval Techniques · Error Correcting Code Techniques
