Signal in Noise: Exploring Meaning Encoded in Random Character Sequences   with Character-Aware Language Models

Mark Chu; Bhargav Srinivasa Desikan; Ethan O. Nadler; D. Ruggiero Lo; Sardo; Elise Darragh-Ford; and Douglas Guilbeault

arXiv:2203.07911·cs.CL·April 21, 2022·1 cites

Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

Mark Chu, Bhargav Srinivasa Desikan, Ethan O. Nadler, D. Ruggiero Lo, Sardo, Elise Darragh-Ford, and Douglas Guilbeault

PDF

Open Access 1 Repo

TL;DR

This paper investigates how character-aware language models encode meaning and primitive information by analyzing embeddings of random character sequences, revealing intrinsic links between primitive signals and linguistic structure.

Contribution

It introduces a novel approach using random character sequences to study meaning in language models, uncovering how primitive information relates to linguistic features.

Findings

01

An axis in embedding space separates random, pseudowords, and real words.

02

This axis correlates with part-of-speech, morphology, and concreteness.

03

Primitive character information is linked to linguistic structure.

Abstract

Natural language processing models learn word representations based on the distributional hypothesis, which asserts that word context (e.g., co-occurrence) correlates with meaning. We propose that $n$ -grams composed of random character sequences, or $g a r b l e$ , provide a novel context for studying word meaning both within and beyond extant language. In particular, randomly generated character $n$ -grams lack meaning but contain primitive information based on the distribution of characters they contain. By studying the embeddings of a large corpus of garble, extant language, and pseudowords using CharacterBERT, we identify an axis in the model's high-dimensional embedding space that separates these classes of $n$ -grams. Furthermore, we show that this axis relates to structure within extant language, including word part-of-speech, morphology, and concept concreteness. Thus, in contrast to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

comp-syn/garble
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Language and cultural evolution

MethodsCharacterBERT