The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

Adrian Cosma; Stefan Ruseti; Emilian Radoi; Mihai Dascalu

arXiv:2505.14172·cs.CL·September 17, 2025

The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

Adrian Cosma, Stefan Ruseti, Emilian Radoi, Mihai Dascalu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates why large language models struggle with character-level tasks due to tokenization issues, analyzes the emergence of character reasoning capabilities, and proposes architectural improvements to enhance their low-level understanding.

Contribution

It introduces a controlled experimental framework to study character reasoning emergence and proposes a lightweight model modification to improve character-level understanding in tokenized LLMs.

Findings

01

Character reasoning emerges late in training

02

Percolation models explain concept emergence patterns

03

Architectural tweaks improve character reasoning

Abstract

Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge suddenly and only late in training. We find that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cosmaadrian/strawberry-problem
pytorchOfficial

Videos

The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling