Byte BPE Tokenization as an Inverse string Homomorphism
Saibo Geng, Sankalp Gambhir, Chris Wendler, Robert West

TL;DR
This paper reveals that tokenization in large language models functions as an inverse homomorphism, preserving structural properties of the source language, and shows that proper tokenization does not limit neural network expressiveness.
Contribution
It introduces the concept of tokenization as an inverse string homomorphism and analyzes its implications on language structure and neural network expressiveness.
Findings
Tokenization acts as an inverse homomorphism between strings and tokens.
Proper tokenization is unambiguous and preserves language structure.
Neural architectures' ability to recognize context-free languages is unaffected by tokenization.
Abstract
Tokenization is an important preprocessing step in the training and inference of large language models (LLMs). While there has been extensive research on the expressive power of the neural achitectures used in LLMs, the impact of tokenization has not been well understood. In this work, we demonstrate that tokenization, irrespective of the algorithm used, acts as an inverse homomorphism between strings and tokens. This suggests that the character space of the source language and the token space of the tokenized language are homomorphic, preserving the structural properties of the source language. Additionally, we explore the concept of proper tokenization, which refers to an unambiguous tokenization returned from the tokenizer. Our analysis reveals that the expressiveness of neural architectures in recognizing context-free languages is not affected by tokenization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · Algorithms and Data Compression · DNA and Biological Computing
