StochasTok: Improving Fine-Grained Subword Understanding in LLMs

Anya Sims; Thom Foster; Klara Kaleb; Tuan-Duy H. Nguyen; Joseph Lee; Jakob N. Foerster; Yee Whye Teh; Cong Lu

arXiv:2506.01687·cs.CL·April 22, 2026

StochasTok: Improving Fine-Grained Subword Understanding in LLMs

Anya Sims, Thom Foster, Klara Kaleb, Tuan-Duy H. Nguyen, Joseph Lee, Jakob N. Foerster, Yee Whye Teh, Cong Lu

PDF

1 Repo 1 Video

TL;DR

StochasTok is a stochastic tokenization method that enhances large language models' understanding of subword structures by randomly splitting tokens during training, improving performance on subword-level tasks.

Contribution

The paper introduces StochasTok, a simple and efficient stochastic tokenization scheme that can be integrated during or after training to improve subword understanding in LLMs.

Findings

01

Pretraining with StochasTok improves downstream subword tasks.

02

Post-training with StochasTok enhances existing models' subword understanding.

03

The method is simple, efficient, and adaptable at any training stage.

Abstract

Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still struggle disproportionally with simple subword-level tasks like 'How many r's in strawberry?'. A key factor behind these failures is tokenization, which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to 'see' their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs' downstream performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anyasims/stochastok
github

Videos

StochasTok: Improving Fine-Grained Subword Understanding in LLMs· slideslive