Large-scale cloze evaluation reveals that token prediction tasks are   neither lexically nor semantically aligned

Cassandra L. Jacobs; Lo\"ic Grobol; Alvin Tsang

arXiv:2410.12057·cs.CL·October 29, 2024

Large-scale cloze evaluation reveals that token prediction tasks are neither lexically nor semantically aligned

Cassandra L. Jacobs, Lo\"ic Grobol, Alvin Tsang

PDF

Open Access 1 Repo

TL;DR

This study compares large language models to human responses in cloze tasks, revealing that models do not accurately reflect human lexical or semantic preferences despite improved estimations with size and training duration.

Contribution

It provides a detailed analysis showing that current language models are misaligned with human responses in lexical and semantic aspects of the cloze task.

Findings

01

Large models underestimate human response probabilities.

02

Models over-rank rare responses and under-rank common ones.

03

Language models produce distinct semantic spaces from humans.

Abstract

In this work we compare the generative behavior at the next token prediction level in several language models by comparing them to human productions in the cloze task. We find that while large models trained for longer are typically better estimators of human productions, but they reliably under-estimate the probabilities of human responses, over-rank rare responses, under-rank top responses, and produce highly distinct semantic spaces. Altogether, this work demonstrates in a tractable, interpretable domain that LM generations can not be used as replacements of or models of the cloze task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

calicolab/clamp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage and cultural evolution · Natural Language Processing Techniques · Topic Modeling