Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models
Mark Chu, Bhargav Srinivasa Desikan, Ethan O. Nadler, D. Ruggiero Lo, Sardo, Elise Darragh-Ford, and Douglas Guilbeault

TL;DR
This paper investigates how character-aware language models encode meaning and primitive information by analyzing embeddings of random character sequences, revealing intrinsic links between primitive signals and linguistic structure.
Contribution
It introduces a novel approach using random character sequences to study meaning in language models, uncovering how primitive information relates to linguistic features.
Findings
An axis in embedding space separates random, pseudowords, and real words.
This axis correlates with part-of-speech, morphology, and concreteness.
Primitive character information is linked to linguistic structure.
Abstract
Natural language processing models learn word representations based on the distributional hypothesis, which asserts that word context (e.g., co-occurrence) correlates with meaning. We propose that -grams composed of random character sequences, or , provide a novel context for studying word meaning both within and beyond extant language. In particular, randomly generated character -grams lack meaning but contain primitive information based on the distribution of characters they contain. By studying the embeddings of a large corpus of garble, extant language, and pseudowords using CharacterBERT, we identify an axis in the model's high-dimensional embedding space that separates these classes of -grams. Furthermore, we show that this axis relates to structure within extant language, including word part-of-speech, morphology, and concept concreteness. Thus, in contrast to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Language and cultural evolution
MethodsCharacterBERT
