Towards Efficient Matching of Regexes with Backreferences using Register Set Automata (Technical Report)
Vojt\v{e}ch Havlena, Luk\'a\v{s} Hol\'ik, Ond\v{r}ej Leng\'al, Jan Va\v{s}\'ak, Sab\'ina Gul\v{c}\'ikov\'a

TL;DR
This paper introduces register set automata (RSAs) to enable high-speed matching of regexes with backreferences, improving robustness and efficiency over traditional backtracking methods.
Contribution
It proposes RSAs as an extension of register automata, along with algorithms for transforming regexes into RSAs and demonstrates their practical effectiveness.
Findings
Prototype implementation significantly improves regex matching robustness.
Matching complexity is linear or quadratic depending on alphabet finiteness.
Theoretical properties of RSAs are established, including decidability of emptiness.
Abstract
Matching regexes (regular expressions) is a common problem in many areas of computer science, with requirements on high speed and robust performance. Regexes with backreferences allow one to express certain patterns (even beyond regular) concisely, however, since the matching is usually done by backtracking, the matching speed can degrade to a degree that constitutes a service failure or a security threat. To facilitate high-speed matching of such regexes, we propose register set automata (RSAs), an extension of register automata where registers can contain sets of symbols (from a potentially infinite alphabet) and the following operations are supported: adding input values to registers, merging or clearing registers, and testing whether a register contains a value. We show that a large class of register automata can be transformed into deterministic RSAs, which can serve as a basis for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
