Leading Whitespaces of Language Models' Subword Vocabulary Pose a   Confound for Calculating Word Probabilities

Byung-Doh Oh; William Schuler

arXiv:2406.10851·cs.CL·October 1, 2024

Leading Whitespaces of Language Models' Subword Vocabulary Pose a Confound for Calculating Word Probabilities

Byung-Doh Oh, William Schuler

PDF

Open Access

TL;DR

This paper identifies a confound in how language models calculate word probabilities due to leading whitespaces in subword tokens, proposing a correction method that improves the accuracy of surprisal estimates and aligns better with psycholinguistic data.

Contribution

The paper demonstrates that leading whitespaces in subword vocabularies distort word probability distributions and introduces a decoding technique to correct this issue.

Findings

01

Corrects overestimation of word probabilities caused by leading whitespaces

02

Reveals lower garden-path effect estimates after correction

03

Provides a method that better aligns model predictions with human reading data

Abstract

Predictions of word-by-word conditional probabilities from Transformer-based language models are often evaluated to model the incremental processing difficulty of human readers. In this paper, we argue that there is a confound posed by the most common method of aggregating subword probabilities of such language models into word probabilities. This is due to the fact that tokens in the subword vocabulary of most language models have leading whitespaces and therefore do not naturally define stop probabilities of words. We first prove that this can result in distributions over word probabilities that sum to more than one, thereby violating the axiom that $P (Ω) = 1$ . This property results in a misallocation of word-by-word surprisal, where the unacceptability of the end of the current word is incorrectly carried over to the next word. Additionally, this implicit prediction of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques