TL;DR
This paper introduces a method to convert autoregressive language models with BPE tokenizers into character-level or byte-level models, addressing the Prompt Boundary Problem and enabling ensemble and transfer learning.
Contribution
The authors propose an inference-time technique to unify vocabularies and mitigate tokenization issues in language models, facilitating ensemble and transfer learning.
Findings
Effectively solves the Prompt Boundary Problem at inference time.
Enables ensemble of models with different tokenizers.
Allows transfer learning between models with different tokenization schemes.
Abstract
Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations, an issue known as the Prompt Boundary Problem (PBP). For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. While this heuristic is effective in English, the underlying PBP continues to affect code generation and languages such as Chinese, where tokens often do not line up with word and syntactic boundaries. In this work, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
