CharBench: Evaluating the Role of Tokenization in Character-Level Tasks
Omri Uzan, Yuval Pinter

TL;DR
CharBench is a large benchmark designed to evaluate how tokenization affects character-level reasoning tasks in language models, revealing that current models struggle significantly with these tasks.
Contribution
The paper introduces CharBench, a comprehensive and larger benchmark for character-level tasks, and provides detailed analysis of tokenization's impact on model performance.
Findings
Modern LLMs achieve around 43.6% accuracy on CharBench tasks.
Tokenization properties are weakly correlated with counting task accuracy.
Longer tokens obscure character position information, negatively affecting intra-word positional tasks.
Abstract
Tasks that require character-level reasoning, such as counting or locating characters within words, remain challenging for contemporary language models. A common conjecture is that language models' reliance on subword units, rather than characters, contributes to their struggles with character-level tasks, yet recent studies offer conflicting conclusions about the role of tokenization, leaving its impact unclear. To address this gap, we introduce CharBench, a comprehensive benchmark of character-level tasks that is two orders of magnitude larger than existing alternatives. We evaluate a diverse range of leading open-weight and proprietary models on CharBench and find that it presents a significant challenge to modern LLMs, with an average accuracy of 43.6% and 32.3% on some tasks. We present an in-depth analysis of how intrinsic properties of words and their segmentations into tokens…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
