An index for regular expression queries: Design and implementation
Dominic Tsang, Sanjay Chawla

TL;DR
This paper introduces a new indexing method for regular expression queries in databases, formulating it as an optimization problem and providing algorithms with proven guarantees, significantly improving query performance.
Contribution
It presents a novel, robust approach to index regular expression queries by generating multigrams through optimization, supported by algorithms with theoretical guarantees.
Findings
Accurate and efficient indexing demonstrated on synthetic datasets
Effective indexing for complex PROSITE protein patterns
First practical indexing mechanism for regular expression queries
Abstract
The like regular expression predicate has been part of the SQL standard since at least 1989. However, despite its popularity and wide usage, database vendors provide only limited indexing support for regular expression queries which almost always require a full table scan. In this paper we propose a rigorous and robust approach for providing indexing support for regular expression queries. Our approach consists of formulating the indexing problem as a combinatorial optimization problem. We begin with a database, abstracted as a collection of strings. From this data set we generate a query workload. The input to the optimization problem is the database and the workload. The output is a set of multigrams (substrings) which can be used as keys to records which satisfy the query workload. The multigrams can then be integrated with the data structure (like B+ trees) to provide indexing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Database Systems and Queries · Network Packet Processing and Optimization
