Capability Ceilings in Autoregressive Language Models: Empirical Evidence from Knowledge-Intensive Tasks
Javier Mar\'in

TL;DR
This paper empirically demonstrates that larger autoregressive language models exhibit capability ceilings in knowledge-intensive tasks, with limited accuracy gains despite increased parameters and loss improvements, highlighting fundamental or implementation constraints.
Contribution
The study provides the first systematic empirical evidence of capability ceilings in decoder-only autoregressive models across various knowledge-intensive tasks, revealing scaling limitations.
Findings
Knowledge retrieval accuracy remains flat despite scaling.
Procedural tasks show conventional scaling with accuracy improvements.
Attention perturbation causes catastrophic performance collapse.
Abstract
We document empirical capability ceilings in decoder-only autoregressive language models across knowledge-intensive tasks. Systematic evaluation of OPT and Pythia model families (70M-30B parameters, spanning 240 times scaling) reveals that knowledge retrieval tasks show negligible accuracy improvement despite smooth loss reduction. On MMLU mathematics benchmarks, accuracy remains flat at 19-20% (below 25% random chance) across all scales while cross-entropy loss decreases by 31%. In contrast, procedural tasks like arithmetic show conventional scaling where both metrics improve together. Attention intervention experiments reveal high sensitivity to perturbation: swapping attention patterns between models causes catastrophic performance collapse (complete accuracy loss) rather than graceful degradation. These measurements have immediate engineering implications: for knowledge-intensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
