CSA++: Fast Pattern Search for Large Alphabets
Simon Gog, Alistair Moffat, Matthias Petri

TL;DR
This paper introduces CSA++, a novel pattern search method for large alphabets that combines inverted indexing techniques with compressed suffix arrays, achieving faster search speeds and reduced space compared to previous methods.
Contribution
The paper adapts Elias-Fano coding to compressed suffix arrays, enabling efficient pattern search in large alphabets with improved speed and space efficiency.
Findings
Significantly faster pattern processing than previous implementations
Reduced space requirements close to highly-compressed FM-Index variants
Effective for large-scale data and natural language processing applications
Abstract
Indexed pattern search in text has been studied for many decades. For small alphabets, the FM-Index provides unmatched performance, in terms of both space required and search speed. For large alphabets -- for example, when the tokens are words -- the situation is more complex, and FM-Index representations are compact, but potentially slow. In this paper we apply recent innovations from the field of inverted indexing and document retrieval to compressed pattern search, including for alphabets into the millions. Commencing with the practical compressed suffix array structure developed by Sadakane, we show that the Elias-Fano code-based approach to document indexing can be adapted to provide new tradeoff options in indexed pattern search, and offers significantly faster pattern processing compared to previous implementations, as well as reduced space requirements. We report a detailed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
