Large-scale cloze evaluation reveals that token prediction tasks are neither lexically nor semantically aligned
Cassandra L. Jacobs, Lo\"ic Grobol, Alvin Tsang

TL;DR
This study compares large language models to human responses in cloze tasks, revealing that models do not accurately reflect human lexical or semantic preferences despite improved estimations with size and training duration.
Contribution
It provides a detailed analysis showing that current language models are misaligned with human responses in lexical and semantic aspects of the cloze task.
Findings
Large models underestimate human response probabilities.
Models over-rank rare responses and under-rank common ones.
Language models produce distinct semantic spaces from humans.
Abstract
In this work we compare the generative behavior at the next token prediction level in several language models by comparing them to human productions in the cloze task. We find that while large models trained for longer are typically better estimators of human productions, but they reliably under-estimate the probabilities of human responses, over-rank rare responses, under-rank top responses, and produce highly distinct semantic spaces. Altogether, this work demonstrates in a tractable, interpretable domain that LM generations can not be used as replacements of or models of the cloze task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Natural Language Processing Techniques · Topic Modeling
