A closer look at TDFA

Angelo Borsotti; Ulya Trafimovich

arXiv:2206.01398·cs.FL·March 31, 2026

A closer look at TDFA

Angelo Borsotti, Ulya Trafimovich

PDF

TL;DR

This paper introduces a fast algorithm for regular expression parsing using tagged deterministic finite automata, applicable with various disambiguation policies and optimized for practical performance.

Contribution

It provides a detailed algorithm with pseudocode, practical optimizations, and implementation insights for regular expression parsing based on tagged automata.

Findings

01

The algorithm is very fast in practice according to benchmarks.

02

It supports both ahead-of-time and just-in-time determinization.

03

Two independent implementations demonstrate its effectiveness.

Abstract

We present an algorithm for regular expression parsing and submatch extraction based on tagged deterministic finite automata. The algorithm works with different disambiguation policies. We give detailed pseudocode for the algorithm, covering important practical optimizations. All transformations from a regular expression to an optimized automaton are explained on a step-by-step example. We consider both ahead-of-time and just-in-time determinization and describe variants of the algorithm suited to each setting. We provide benchmarks showing that the algorithm is very fast in practice. Our research is based on two independent implementations: an open-source lexer generator RE2C and an experimental Java library.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.