The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models
Adrian Cosma, Stefan Ruseti, Emilian Radoi, Mihai Dascalu

TL;DR
This paper investigates why large language models struggle with character-level tasks due to tokenization issues, analyzes the emergence of character reasoning capabilities, and proposes architectural improvements to enhance their low-level understanding.
Contribution
It introduces a controlled experimental framework to study character reasoning emergence and proposes a lightweight model modification to improve character-level understanding in tokenized LLMs.
Findings
Character reasoning emerges late in training
Percolation models explain concept emergence patterns
Architectural tweaks improve character reasoning
Abstract
Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge suddenly and only late in training. We find that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
