CSA++: Fast Pattern Search for Large Alphabets

Simon Gog; Alistair Moffat; Matthias Petri

arXiv:1605.05404·cs.DS·May 19, 2016

CSA++: Fast Pattern Search for Large Alphabets

Simon Gog, Alistair Moffat, Matthias Petri

PDF

TL;DR

This paper introduces CSA++, a novel pattern search method for large alphabets that combines inverted indexing techniques with compressed suffix arrays, achieving faster search speeds and reduced space compared to previous methods.

Contribution

The paper adapts Elias-Fano coding to compressed suffix arrays, enabling efficient pattern search in large alphabets with improved speed and space efficiency.

Findings

01

Significantly faster pattern processing than previous implementations

02

Reduced space requirements close to highly-compressed FM-Index variants

03

Effective for large-scale data and natural language processing applications

Abstract

Indexed pattern search in text has been studied for many decades. For small alphabets, the FM-Index provides unmatched performance, in terms of both space required and search speed. For large alphabets -- for example, when the tokens are words -- the situation is more complex, and FM-Index representations are compact, but potentially slow. In this paper we apply recent innovations from the field of inverted indexing and document retrieval to compressed pattern search, including for alphabets into the millions. Commencing with the practical compressed suffix array structure developed by Sadakane, we show that the Elias-Fano code-based approach to document indexing can be adapted to provide new tradeoff options in indexed pattern search, and offers significantly faster pattern processing compared to previous implementations, as well as reduced space requirements. We report a detailed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.