Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters
Tatsuya Hiraoka, Kentaro Inui

TL;DR
This paper investigates how large language models process character-level information during spelling tasks, revealing that they rely on higher layers to reconstruct character knowledge rather than encoding it fully in initial layers.
Contribution
The study uncovers the internal mechanisms of LLMs' character-level processing, highlighting the layered approach to reconstructing character information during spelling.
Findings
Embedding layers do not fully encode character information.
Higher Transformer layers reconstruct character knowledge.
Spelling behavior shows a distinct layer-wise breakthrough.
Abstract
Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks, such as identifying compositional subcomponents within tokens. In this work, we investigate how LLMs internally represent and utilize character-level information during the spelling-out process. Our analysis reveals that, although spelling out is a simple task for humans, it is not handled in a straightforward manner by LLMs. Specifically, we show that the embedding layer does not fully encode character-level information, particularly beyond the first character. As a result, LLMs rely on intermediate and higher Transformer layers to reconstruct character-level knowledge, where we observe a distinct "breakthrough" in their spelling behavior. We validate this mechanism through three complementary analyses: probing classifiers,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Text Readability and Simplification · Topic Modeling
MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer · Attention Is All You Need
