Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Brian Siyuan Zheng; Alisa Liu; Orevaoghene Ahia; Jonathan Hayase; Yejin Choi; Noah A. Smith

arXiv:2506.19004·cs.CL·February 4, 2026

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Brian Siyuan Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, Noah A. Smith

PDF

Open Access

TL;DR

This paper reveals that language models are surprisingly robust to non-canonical tokenizations, maintaining high performance across benchmarks and even benefiting from certain non-standard tokenization schemes, with robustness stemming from instruction-tuning.

Contribution

It demonstrates the robustness of instruction-tuned language models to unseen non-canonical tokenizations and explores how tokenization can be optimized to improve specific tasks.

Findings

01

Models retain up to 93.4% performance with random tokenizations.

02

Character-level tokenization improves string and code tasks by up to +14%.

03

Non-canonical tokenization robustness arises during instruction-tuning.

Abstract

Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the robustness of LMs to text encoded with non-canonical tokenizations entirely unseen during training. Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization, and 90.8% with character-level tokenization. We see that overall stronger models tend to be more robust, and robustness diminishes as the tokenization departs farther from the canonical form. Motivated by these results, we then identify settings where non-canonical tokenization schemes can *improve* performance, finding that character-level segmentation improves string…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques