Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens
Itay Itzhak, Omer Levy

TL;DR
Pretrained language models implicitly learn the character composition of tokens, enabling them to spell words without explicit character-level training, and adding spelling information does not significantly improve their performance.
Contribution
This study reveals that language models inherently acquire character-level knowledge of tokens, challenging assumptions about the need for explicit spelling training.
Findings
Models can spell up to a third of the vocabulary accurately.
High character ngram overlap across token types.
Adding explicit spelling information does not improve model performance.
Abstract
Standard pretrained language models operate on sequences of subword tokens without direct access to the characters that compose each token's string representation. We probe the embedding layer of pretrained language models and show that models learn the internal character composition of whole word and subword tokens to a surprising extent, without ever seeing the characters coupled with the tokens. Our results show that the embedding layer of RoBERTa holds enough information to accurately spell up to a third of the vocabulary and reach high average character ngram overlap on all token types. We further test whether enriching subword models with additional character information can improve language modeling, and observe that this method has a near-identical learning curve as training without spelling-based enrichment. Overall, our results suggest that language modeling objectives…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsAttention Is All You Need · Linear Layer · Attention Dropout · Dense Connections · Dropout · Weight Decay · Residual Connection · Multi-Head Attention · Adam · Softmax
