Generating Tokenizers with Flat Automata

Hans de Nivelle (School of Engineering; Digital Sciences,; Nazarbayev University; Nursultan-City; Kazakkhstan); Dina Muktubayeva (School; of Engineering; Digital Sciences; Nazarbayev University; Nursultan-City,; Kazakhstan)

arXiv:2209.10313·cs.FL·September 22, 2022·GandALF

Generating Tokenizers with Flat Automata

Hans de Nivelle (School of Engineering, Digital Sciences,, Nazarbayev University, Nursultan-City, Kazakkhstan), Dina Muktubayeva (School, of Engineering, Digital Sciences, Nazarbayev University, Nursultan-City,, Kazakhstan)

PDF

TL;DR

This paper presents flat automata, a simplified and compact automaton representation for automatic tokenizer generation, with algorithms and C++ implementation that improve construction and usability.

Contribution

Introduction of flat automata for tokenizer generation, simplifying automaton operations and enabling easy code generation, with proven correctness and practical implementation.

Findings

01

Flat automata are more compact than standard automata with character intervals.

02

Algorithms for construction, determinization, and minimization are correct and simpler.

03

C++ implementation is publicly available and used in applications and teaching.

Abstract

We introduce flat automata for automatic generation of tokenizers. Flat automata are a simple representation of standard finite automata. Using the flat representation, automata can be easily constructed, combined and printed. Due to the use of border functions, flat automata are more compact than standard automata in the case where intervals of characters are attached to transitions, and the standard algorithms on automata are simpler. We give the standard algorithms for tokenizer construction with automata, namely construction using regular operations, determinization, and minimization. We prove their correctness. The algorithms work with intervals of characters, but are not more complicated than their counterparts on single characters. It is easy to generate C++ code from the final deterministic automaton. All procedures have been implemented in C++ and are publicly available.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.