What do tokens know about their characters and how do they know it?
Ayush Kaushal, Kyle Mahowald

TL;DR
Pre-trained language models encode detailed character-level information within their token embeddings, which can be probed and analyzed across multiple languages and model sizes, revealing mechanisms of knowledge acquisition during training.
Contribution
This study systematically probes how pre-trained models encode character information, demonstrating their ability to predict character presence and analyzing the mechanisms behind this knowledge acquisition.
Findings
Models encode character information robustly across languages.
Larger models perform better at encoding character details.
Character knowledge is acquired through multiple phenomena during training.
Abstract
Pre-trained language models (PLMs) that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information, despite lacking explicit access to the character composition of tokens. Here, studying a range of models (e.g., GPT- J, BERT, RoBERTa, GloVe), we probe what word pieces encode about character-level information by training classifiers to predict the presence or absence of a particular alphabetical character in a token, based on its embedding (e.g., probing whether the model embedding for "cat" encodes that it contains the character "a"). We find that these models robustly encode character-level information and, in general, larger models perform better at the task. We show that these results generalize to characters from non-Latin alphabets (Arabic, Devanagari, and Cyrillic). Then, through a series of experiments and analyses, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Softmax · Layer Normalization · Attention Dropout · WordPiece · Adam
