Tokenization as Finite-State Transduction

Marco Cognetta; Naoaki Okazaki

arXiv:2410.15696·cs.CL·October 22, 2024

Tokenization as Finite-State Transduction

Marco Cognetta, Naoaki Okazaki

PDF

Open Access

TL;DR

This paper presents a finite-state transduction framework for tokenization, unifying popular schemes like BPE and MaxMatch, and enabling constrained language model generation aligned with specific tokenization patterns.

Contribution

It introduces a first-principles finite-state framework for tokenization that encompasses existing methods and supports pattern-constrained generation respecting tokenization.

Findings

01

BPE and MaxMatch fit within the finite-state transduction framework

02

The framework enables pattern-constrained generation aligned with tokenization

03

It allows for more precise control over language model outputs

Abstract

Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode all possible tokenizations of a regular language. We then constructively show that Byte-Pair Encoding (BPE) and MaxMatch (WordPiece), two popular tokenization schemes, fit within this framework. For BPE, this is particularly surprising given its resemblance to context-free grammar and the fact that it does not tokenize strings from left to right. An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern. Here, patterns are encoded at the character level, which creates a mismatch between the constraints and the model's subword vocabulary. While past work has focused only on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDNA and Biological Computing · Embedded Systems Design Techniques

MethodsByte Pair Encoding