CONTESTS: a Framework for Consistency Testing of Span Probabilities in   Language Models

Eitan Wagner; Yuli Slavutsky; Omri Abend

arXiv:2409.19984·cs.CL·October 1, 2024

CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models

Eitan Wagner, Yuli Slavutsky, Omri Abend

PDF

Open Access 1 Video

TL;DR

This paper introduces ConTestS, a framework for testing the consistency of span probability scores in language models, revealing discrepancies across models and suggesting entropy-based insights for decoding.

Contribution

It presents a novel statistical testing framework to evaluate score consistency in language models, highlighting differences between MLMs and autoregressive models.

Findings

01

Autoregressive models show larger inconsistencies than MLMs.

02

Larger MLMs tend to be more consistent in predictions.

03

Prediction entropies can guide decoding strategies.

Abstract

Although language model scores are often treated as probabilities, their reliability as probability estimators has mainly been studied through calibration, overlooking other aspects. In particular, it is unclear whether language models produce the same value for different ways of assigning joint probabilities to word spans. Our work introduces a novel framework, ConTestS (Consistency Testing over Spans), involving statistical tests to assess score consistency across interchangeable completion and conditioning orders. We conduct experiments on post-release real and synthetic data to eliminate training effects. Our findings reveal that both Masked Language Models (MLMs) and autoregressive models exhibit inconsistent predictions, with autoregressive models showing larger discrepancies. Larger MLMs tend to produce more consistent predictions, while autoregressive models show the opposite…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling