Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Julia Witte Zimmerman; Denis Hudon; Kathryn Cramer; Alejandro J. Ruiz; Calla Beauregard; Ashley Fehr; Mikaela Irene Fudolig; Bradford Demarest; Yoshi Meke Bird; Milo Z. Trujillo; Christopher M. Danforth; Peter Sheridan Dodds

arXiv:2412.10924·cs.CL·November 25, 2025

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Alejandro J. Ruiz, Calla Beauregard, Ashley Fehr, Mikaela Irene Fudolig, Bradford Demarest, Yoshi Meke Bird, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds

PDF

Open Access

TL;DR

This paper examines how tokenization influences large language models' understanding and cognition, highlighting the importance of linguistically meaningful units and the impact of tokenization algorithms on model bias and semantic access.

Contribution

It argues for revising tokenization techniques to better reflect linguistic units and explores how tokenization affects LLM cognition, bias, and meaning construction.

Findings

01

Tokenization impacts LLM's access to distributional patterns.

02

Current tokenization methods create suboptimal semantic units.

03

Tokenization algorithms influence model bias and cognition.

Abstract

Tokenization is a necessary component within the current architecture of many language mod-els, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance (particularly with respect to inferential lexical competence), and that the emergence of human-meaningful linguistic units among tokens and current structural constraints motivate changes to existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) vehicles for conveying salient distributional patterns from human language to the model and as (2) semantic primitives. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Adam · Layer Normalization · Residual Connection · Weight Decay · WordPiece · Softmax