What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length
Lindia Tjuatja, Graham Neubig, Tal Linzen, Sophie Hao

TL;DR
This paper introduces MORCELA, a data-driven method for adjusting language model scores to better match human acceptability judgments by accounting for length and unigram frequency effects, outperforming previous approaches.
Contribution
The paper proposes MORCELA, a novel linking theory with learned parameters for length and frequency adjustments, improving alignment between LM scores and human judgments.
Findings
MORCELA outperforms SLOR across transformer LMs.
Larger models require less adjustment for unigram frequency.
Larger LMs better predict rare words, reducing frequency effects.
Abstract
When comparing the linguistic capabilities of language models (LMs) with humans using LM probabilities, factors such as the length of the sequence and the unigram frequency of lexical items have a significant effect on LM probabilities in ways that humans are largely robust to. Prior works in comparing LM and human acceptability judgments treat these effects uniformly across models, making a strong assumption that models require the same degree of adjustment to control for length and unigram frequency effects. We propose MORCELA, a new linking theory between LM scores and acceptability judgments where the optimal level of adjustment for these effects is estimated from data via learned parameters for length and unigram frequency. We first show that MORCELA outperforms a commonly used linking theory for acceptability - SLOR (Pauls and Klein, 2012; Lau et al. 2017) - across two families of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsExperimental Learning in Engineering
