Linear-time Minimization of Wheeler DFAs

Jarno Alanko; Nicola Cotumaccio; Nicola Prezza

arXiv:2111.02480·cs.DS·November 5, 2021

Linear-time Minimization of Wheeler DFAs

Jarno Alanko, Nicola Cotumaccio, Nicola Prezza

PDF

Open Access

TL;DR

This paper introduces a linear-time algorithm for minimizing Wheeler DFAs, significantly improving efficiency over previous methods and enabling faster, more compact data structures for pattern matching in large datasets.

Contribution

The authors develop the first linear-time minimization algorithm for Wheeler DFAs, surpassing the prior $O(n \, log \, n)$ complexity inherited from general DFA minimization.

Findings

01

Reduces node count by up to 51% on DNA datasets

02

Achieves over 1 million nodes per second in implementation

03

Enables more efficient compressed data structures for pattern matching

Abstract

Wheeler DFAs (WDFAs) are a sub-class of finite-state automata which is playing an important role in the emerging field of compressed data structures: as opposed to general automata, WDFAs can be stored in just $lo g σ + O (1)$ bits per edge, $σ$ being the alphabet's size, and support optimal-time pattern matching queries on the substring closure of the language they recognize. An important step to achieve further compression is minimization. When the input $A$ is a general deterministic finite-state automaton (DFA), the state-of-the-art is represented by the classic Hopcroft's algorithm, which runs in $O (∣ A ∣ lo g ∣ A ∣)$ time. This algorithm stands at the core of the only existing minimization algorithm for Wheeler DFAs, which inherits its complexity. In this work, we show that the minimum WDFA equivalent to a given input WDFA can be computed in linear…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDNA and Biological Computing · semigroups and automata theory · Network Packet Processing and Optimization