Leading Whitespaces of Language Models' Subword Vocabulary Pose a Confound for Calculating Word Probabilities
Byung-Doh Oh, William Schuler

TL;DR
This paper identifies a confound in how language models calculate word probabilities due to leading whitespaces in subword tokens, proposing a correction method that improves the accuracy of surprisal estimates and aligns better with psycholinguistic data.
Contribution
The paper demonstrates that leading whitespaces in subword vocabularies distort word probability distributions and introduces a decoding technique to correct this issue.
Findings
Corrects overestimation of word probabilities caused by leading whitespaces
Reveals lower garden-path effect estimates after correction
Provides a method that better aligns model predictions with human reading data
Abstract
Predictions of word-by-word conditional probabilities from Transformer-based language models are often evaluated to model the incremental processing difficulty of human readers. In this paper, we argue that there is a confound posed by the most common method of aggregating subword probabilities of such language models into word probabilities. This is due to the fact that tokens in the subword vocabulary of most language models have leading whitespaces and therefore do not naturally define stop probabilities of words. We first prove that this can result in distributions over word probabilities that sum to more than one, thereby violating the axiom that . This property results in a misallocation of word-by-word surprisal, where the unacceptability of the end of the current word is incorrectly carried over to the next word. Additionally, this implicit prediction of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
