Learning Syntax Without Planting Trees: Understanding Hierarchical Generalization in Transformers
Kabir Ahuja, Vidhisha Balachandran, Madhur Panwar, Tianxing He, Noah, A. Smith, Navin Goyal, Yulia Tsvetkov

TL;DR
This paper investigates how transformer models trained on language data naturally learn hierarchical syntactic structures, focusing on the influence of training objectives and internal subnetworks, revealing a preference for hierarchical generalization.
Contribution
The study identifies that language modeling objectives promote hierarchical generalization in transformers and uncovers internal subnetworks responsible for different generalization behaviors.
Findings
Language modeling objectives lead to hierarchical generalization.
Pruning reveals subnetworks with different structural biases.
Transformers prefer hierarchical explanations when they exist.
Abstract
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures without explicitly encoding any structural bias. In this work, we investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge. We extensively experiment with transformer models trained on multiple synthetic datasets and with different training objectives and show that while other objectives e.g. sequence-to-sequence modeling, prefix language modeling, often failed to lead to hierarchical generalization, models trained with the language modeling objective consistently learned to generalize hierarchically. We then conduct pruning experiments to study how transformers trained with the language modeling objective encode hierarchical structure. When pruned, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Language and cultural evolution
MethodsPruning
