How to Compute the Probability of a Word
Tiago Pimentel, Clara Meister

TL;DR
This paper clarifies the correct methods for computing word probabilities from subword language models, revealing widespread errors in prior research and demonstrating the impact of these corrections on linguistic analysis outcomes.
Contribution
It derives the proper techniques for probability calculation over words from subword models and highlights issues with common tokenization methods like bow-marking.
Findings
Incorrect probability computations are common in recent studies.
Correcting these errors significantly alters linguistic analysis results.
Impacts include changes in sentence comprehension and lexical optimization outcomes.
Abstract
Language models (LMs) estimate a probability distribution over strings in a natural language; these distributions are crucial for computing perplexity and surprisal in linguistics research. While we are usually concerned with measuring these values for words, most LMs operate over subwords. Despite seemingly straightforward, accurately computing probabilities over one unit given probabilities over the other requires care. Indeed, we show here that many recent linguistic studies have been incorrectly computing these values. This paper derives the correct methods for computing word probabilities, highlighting issues when relying on language models that use beginning-of-word (bow)-marking tokenisers, e.g., the GPT family. Empirically, we show that correcting the widespread bug in probability computations affects measured outcomes in sentence comprehension and lexical optimisation analyses.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Byte Pair Encoding · Attention Dropout · Dropout · Adam · Linear Warmup With Cosine Annealing · Linear Layer · Dense Connections
