TL;DR
StochasTok is a stochastic tokenization method that enhances large language models' understanding of subword structures by randomly splitting tokens during training, improving performance on subword-level tasks.
Contribution
The paper introduces StochasTok, a simple and efficient stochastic tokenization scheme that can be integrated during or after training to improve subword understanding in LLMs.
Findings
Pretraining with StochasTok improves downstream subword tasks.
Post-training with StochasTok enhances existing models' subword understanding.
The method is simple, efficient, and adaptable at any training stage.
Abstract
Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still struggle disproportionally with simple subword-level tasks like 'How many r's in strawberry?'. A key factor behind these failures is tokenization, which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to 'see' their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs' downstream performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
