Generating Tokenizers with Flat Automata
Hans de Nivelle (School of Engineering, Digital Sciences,, Nazarbayev University, Nursultan-City, Kazakkhstan), Dina Muktubayeva (School, of Engineering, Digital Sciences, Nazarbayev University, Nursultan-City,, Kazakhstan)

TL;DR
This paper presents flat automata, a simplified and compact automaton representation for automatic tokenizer generation, with algorithms and C++ implementation that improve construction and usability.
Contribution
Introduction of flat automata for tokenizer generation, simplifying automaton operations and enabling easy code generation, with proven correctness and practical implementation.
Findings
Flat automata are more compact than standard automata with character intervals.
Algorithms for construction, determinization, and minimization are correct and simpler.
C++ implementation is publicly available and used in applications and teaching.
Abstract
We introduce flat automata for automatic generation of tokenizers. Flat automata are a simple representation of standard finite automata. Using the flat representation, automata can be easily constructed, combined and printed. Due to the use of border functions, flat automata are more compact than standard automata in the case where intervals of characters are attached to transitions, and the standard algorithms on automata are simpler. We give the standard algorithms for tokenizer construction with automata, namely construction using regular operations, determinization, and minimization. We prove their correctness. The algorithms work with intervals of characters, but are not more complicated than their counterparts on single characters. It is easy to generate C++ code from the final deterministic automaton. All procedures have been implemented in C++ and are publicly available.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
