
TL;DR
This paper introduces a fast algorithm for regular expression parsing using tagged deterministic finite automata, applicable with various disambiguation policies and optimized for practical performance.
Contribution
It provides a detailed algorithm with pseudocode, practical optimizations, and implementation insights for regular expression parsing based on tagged automata.
Findings
The algorithm is very fast in practice according to benchmarks.
It supports both ahead-of-time and just-in-time determinization.
Two independent implementations demonstrate its effectiveness.
Abstract
We present an algorithm for regular expression parsing and submatch extraction based on tagged deterministic finite automata. The algorithm works with different disambiguation policies. We give detailed pseudocode for the algorithm, covering important practical optimizations. All transformations from a regular expression to an optimized automaton are explained on a step-by-step example. We consider both ahead-of-time and just-in-time determinization and describe variants of the algorithm suited to each setting. We provide benchmarks showing that the algorithm is very fast in practice. Our research is based on two independent implementations: an open-source lexer generator RE2C and an experimental Java library.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
