Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations
Brian Siyuan Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, Noah A. Smith

TL;DR
This paper reveals that language models are surprisingly robust to non-canonical tokenizations, maintaining high performance across benchmarks and even benefiting from certain non-standard tokenization schemes, with robustness stemming from instruction-tuning.
Contribution
It demonstrates the robustness of instruction-tuned language models to unseen non-canonical tokenizations and explores how tokenization can be optimized to improve specific tasks.
Findings
Models retain up to 93.4% performance with random tokenizations.
Character-level tokenization improves string and code tasks by up to +14%.
Non-canonical tokenization robustness arises during instruction-tuning.
Abstract
Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the robustness of LMs to text encoded with non-canonical tokenizations entirely unseen during training. Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization, and 90.8% with character-level tokenization. We see that overall stronger models tend to be more robust, and robustness diminishes as the tokenization departs farther from the canonical form. Motivated by these results, we then identify settings where non-canonical tokenization schemes can *improve* performance, finding that character-level segmentation improves string…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques
