Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters

Tatsuya Hiraoka; Kentaro Inui

arXiv:2506.10641·cs.CL·June 13, 2025

Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters

Tatsuya Hiraoka, Kentaro Inui

PDF

Open Access

TL;DR

This paper investigates how large language models process character-level information during spelling tasks, revealing that they rely on higher layers to reconstruct character knowledge rather than encoding it fully in initial layers.

Contribution

The study uncovers the internal mechanisms of LLMs' character-level processing, highlighting the layered approach to reconstructing character information during spelling.

Findings

01

Embedding layers do not fully encode character information.

02

Higher Transformer layers reconstruct character knowledge.

03

Spelling behavior shows a distinct layer-wise breakthrough.

Abstract

Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks, such as identifying compositional subcomponents within tokens. In this work, we investigate how LLMs internally represent and utilize character-level information during the spelling-out process. Our analysis reveals that, although spelling out is a simple task for humans, it is not handled in a straightforward manner by LLMs. Specifically, we show that the embedding layer does not fully encode character-level information, particularly beyond the first character. As a result, LLMs rely on intermediate and higher Transformer layers to reconstruct character-level knowledge, where we observe a distinct "breakthrough" in their spelling behavior. We validate this mechanism through three complementary analyses: probing classifiers,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage and cultural evolution · Text Readability and Simplification · Topic Modeling

MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer · Attention Is All You Need