Large Language Models are not Models of Natural Language: they are Corpus Models
Csaba Veres

TL;DR
Large language models function more as corpus models, capturing statistical patterns from data rather than embodying true understanding or symbolic cognition, especially evident in their handling of programming languages.
Contribution
The paper challenges the notion that neural language models are models of cognition, proposing they are better understood as corpus models that reflect data distributions.
Findings
Neural models perform well on symbolic tasks like code generation.
Performance on symbolic systems does not imply understanding or symbolic cognition.
The term 'language model' is misleading; 'corpus model' is more accurate.
Abstract
Natural Language Processing (NLP) has become one of the leading application areas in the current Artificial Intelligence boom. Transfer learning has enabled large deep learning neural networks trained on the language modeling task to vastly improve performance in almost all downstream language tasks. Interestingly, when the language models are trained with data that includes software code, they demonstrate remarkable abilities in generating functioning computer code from natural language specifications. We argue that this creates a conundrum for the claim that eliminative neural models are a radical restructuring in our understanding of cognition in that they eliminate the need for symbolic abstractions like generative phrase structure grammars. Because the syntax of programming languages is by design determined by phrase structure grammars, neural models that produce syntactic code are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
