Why Does Surprisal From Larger Transformer-Based Language Models Provide   a Poorer Fit to Human Reading Times?

Byung-Doh Oh; William Schuler

arXiv:2212.12131·cs.CL·December 26, 2022·6 cites

Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

Byung-Doh Oh, William Schuler

PDF

Open Access

TL;DR

This study investigates why larger Transformer language models, despite lower perplexity, better fit to human reading times, actually show less predictive power due to systematic deviations in their surprisal estimates.

Contribution

The paper provides a detailed linguistic analysis revealing how larger models' tendency to memorize causes their surprisal estimates to diverge from human reading behavior.

Findings

01

Larger models have a positive log-linear relationship between perplexity and fit to reading times.

02

Residual error analysis shows larger models underpredict reading times for named entities.

03

Larger models overpredict reading times for function words, indicating divergence from human expectations.

Abstract

This work presents a detailed linguistic analysis into why larger Transformer-based pre-trained language models with more parameters and lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times. First, regression analyses show a strictly monotonic, positive log-linear relationship between perplexity and fit to reading times for the more recently released five GPT-Neo variants and eight OPT variants on two separate datasets, replicating earlier results limited to just GPT-2 (Oh et al., 2022). Subsequently, analysis of residual errors reveals a systematic deviation of the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words such as modals and conjunctions. These results suggest that the propensity of larger Transformer-based models to 'memorize'…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education

MethodsMulti-Head Attention · Attention Is All You Need · OPT · Linear Layer · Byte Pair Encoding · Attention Dropout · Residual Connection · Discriminative Fine-Tuning · Cosine Annealing · Dropout