From Language Models over Tokens to Language Models over Characters
Tim Vieira, Ben LeBrun, Mario Giulianelli, Juan Luis Gastaldi, Brian DuSell, John Terilla, Timothy J. O'Donnell, Ryan Cotterell

TL;DR
This paper introduces algorithms to convert token-based language models into character-based models, enabling more direct character-level processing and improving compression, with empirical benchmarks showing promising results.
Contribution
The paper proposes both exact and approximate algorithms for transforming token-level language models into character-level models, addressing a key challenge in language model implementation.
Findings
Accurately approximates character-level distributions with small computation budgets.
Achieves significant improvements in compression rate (bits/byte).
Demonstrates practical speed and approximation quality across four models.
Abstract
Modern language models are internally -- and mathematically -- distributions over strings rather than strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that -- even with a small computation budget -- our method is able to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsLLaMA
