How Large Language Models Get Stuck: Early structure with persistent errors
Alokesh Manna, William Snyder, Whitney Tabor

TL;DR
This paper investigates why large language models like OPT struggle with certain linguistic tasks, revealing early entrenched errors that persist and proposing the Bigram Hypothesis to explain this phenomenon.
Contribution
It introduces the Bigram Hypothesis, linking early statistical biases to persistent errors in LLMs, supported by qualitative and quantitative analyses.
Findings
OPT fails to consistently prefer grammatical sentences in about one-third of BLiMP classes.
Early biases in likelihood separation tend to persist throughout training.
The Bigram Hypothesis offers a new perspective on entrenched errors in LLMs.
Abstract
Linguistic insights may help make Large Language Model (LLM) training more efficient. We trained Meta's OPT model on the 100M word BabyLM dataset, and evaluated it on the BLiMP benchmark, which consists of 67 classes, each defined by sentence pairs that differ in a targeted syntactic or semantic rule violation. We tested the model's preference for grammatical over ungrammatical sentences across training iterations and grammatical types. In nearly one-third of the BLiMP classes, OPT fails to consistently assign a higher likelihood to grammatical sentences, even after extensive training. When it fails, it often establishes a clear (erroneous) separation of the likelihoods at an early stage of processing and sustains this to the end of our training phase. We hypothesize that this mis-categorization is costly because it creates entrenched biases that must, eventually, be reversed in order…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
