Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning
Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Alejandro J. Ruiz, Calla Beauregard, Ashley Fehr, Mikaela Irene Fudolig, Bradford Demarest, Yoshi Meke Bird, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds

TL;DR
This paper examines how tokenization influences large language models' understanding and cognition, highlighting the importance of linguistically meaningful units and the impact of tokenization algorithms on model bias and semantic access.
Contribution
It argues for revising tokenization techniques to better reflect linguistic units and explores how tokenization affects LLM cognition, bias, and meaning construction.
Findings
Tokenization impacts LLM's access to distributional patterns.
Current tokenization methods create suboptimal semantic units.
Tokenization algorithms influence model bias and cognition.
Abstract
Tokenization is a necessary component within the current architecture of many language mod-els, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance (particularly with respect to inferential lexical competence), and that the emergence of human-meaningful linguistic units among tokens and current structural constraints motivate changes to existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) vehicles for conveying salient distributional patterns from human language to the model and as (2) semantic primitives. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Adam · Layer Normalization · Residual Connection · Weight Decay · WordPiece · Softmax
