The Impact of Depth on Compositional Generalization in Transformer Language Models
Jackson Petty, Sjoerd van Steenkiste, Ishita Dasgupta, Fei Sha, Dan, Garrette, Tal Linzen

TL;DR
This study investigates how the depth of transformer language models influences their ability to generalize compositionally, finding that deeper models perform better but with diminishing returns, and shallower models can be effective.
Contribution
The paper provides empirical evidence that increased depth improves compositional generalization in transformers, independent of overall model size, with practical recommendations for model design.
Findings
Deeper models generalize more compositionally after fine-tuning.
Additional layers yield diminishing returns in performance.
Depth benefits are not solely due to language modeling performance.
Abstract
To process novel sentences, language models (LMs) must generalize compositionally -- combine familiar elements in new ways. What aspects of a model's structure promote compositional generalization? Focusing on transformers, we test the hypothesis, motivated by theoretical and empirical work, that deeper transformers generalize more compositionally. Simply adding layers increases the total number of parameters; to address this confound between depth and size, we construct three classes of models which trade off depth for width such that the total number of parameters is kept constant (41M, 134M and 374M parameters). We pretrain all models as LMs and fine-tune them on tasks that test for compositional generalization. We report three main conclusions: (1) after fine-tuning, deeper models generalize more compositionally than shallower models do, but the benefit of additional layers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
