Towards a theory of how the structure of language is acquired by deep neural networks
Francesco Cagnetta, Matthieu Wyart

TL;DR
This paper investigates how deep neural networks learn language structure by analyzing synthetic datasets generated from a probabilistic context-free grammar, revealing how training data size influences the depth of learned hierarchical representations.
Contribution
It provides an analytical framework linking training set size to the depth of hierarchical structure learned by language models, supported by empirical validation on real texts.
Findings
Correlation range increases with training data size
Deeper grammar representations improve model performance
Scaling laws depend on context window length
Abstract
How much data is required to learn the structure of a language via next-token prediction? We study this question for synthetic datasets generated via a Probabilistic Context-Free Grammar (PCFG) -- a tree-like generative model that captures many of the hierarchical structures found in natural languages. We determine token-token correlations analytically in our model and show that they can be used to build a representation of the grammar's hidden variables, the longer the range the deeper the variable. In addition, a finite training set limits the resolution of correlations to an effective range, whose size grows with that of the training set. As a result, a Language Model trained with increasingly many examples can build a deeper representation of the grammar's structure, thus reaching good performance despite the high dimensionality of the problem. We conjecture that the relationship…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training
