CharBench: Evaluating the Role of Tokenization in Character-Level Tasks

Omri Uzan; Yuval Pinter

arXiv:2508.02591·cs.CL·April 8, 2026

CharBench: Evaluating the Role of Tokenization in Character-Level Tasks

Omri Uzan, Yuval Pinter

PDF

1 Datasets 1 Video

TL;DR

CharBench is a large benchmark designed to evaluate how tokenization affects character-level reasoning tasks in language models, revealing that current models struggle significantly with these tasks.

Contribution

The paper introduces CharBench, a comprehensive and larger benchmark for character-level tasks, and provides detailed analysis of tokenization's impact on model performance.

Findings

01

Modern LLMs achieve around 43.6% accuracy on CharBench tasks.

02

Tokenization properties are weakly correlated with counting task accuracy.

03

Longer tokens obscure character position information, negatively affecting intra-word positional tasks.

Abstract

Tasks that require character-level reasoning, such as counting or locating characters within words, remain challenging for contemporary language models. A common conjecture is that language models' reliance on subword units, rather than characters, contributes to their struggles with character-level tasks, yet recent studies offer conflicting conclusions about the role of tokenization, leaving its impact unclear. To address this gap, we introduce CharBench, a comprehensive benchmark of character-level tasks that is two orders of magnitude larger than existing alternatives. We evaluate a diverse range of leading open-weight and proprietary models on CharBench and find that it presents a significant challenge to modern LLMs, with an average accuracy of 43.6% and 32.3% on some tasks. We present an in-depth analysis of how intrinsic properties of words and their segmentations into tokens…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

omriuz/CharBench
dataset· 63 dl
63 dl

Videos

CharBench: Evaluating the Role of Tokenization in Character-Level Tasks· underline