CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models
Eitan Wagner, Yuli Slavutsky, Omri Abend

TL;DR
This paper introduces ConTestS, a framework for testing the consistency of span probability scores in language models, revealing discrepancies across models and suggesting entropy-based insights for decoding.
Contribution
It presents a novel statistical testing framework to evaluate score consistency in language models, highlighting differences between MLMs and autoregressive models.
Findings
Autoregressive models show larger inconsistencies than MLMs.
Larger MLMs tend to be more consistent in predictions.
Prediction entropies can guide decoding strategies.
Abstract
Although language model scores are often treated as probabilities, their reliability as probability estimators has mainly been studied through calibration, overlooking other aspects. In particular, it is unclear whether language models produce the same value for different ways of assigning joint probabilities to word spans. Our work introduces a novel framework, ConTestS (Consistency Testing over Spans), involving statistical tests to assess score consistency across interchangeable completion and conditioning orders. We conduct experiments on post-release real and synthetic data to eliminate training effects. Our findings reveal that both Masked Language Models (MLMs) and autoregressive models exhibit inconsistent predictions, with autoregressive models showing larger discrepancies. Larger MLMs tend to produce more consistent predictions, while autoregressive models show the opposite…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
