Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations
Matthias Lindemann, Alexander Koller, Ivan Titov

TL;DR
This paper proposes a pre-training method that enhances Transformers' structural inductive biases by training on syntactic transformations, improving few-shot learning and generalization in syntactic and semantic tasks.
Contribution
Introducing intermediate pre-training on syntactic transformations to strengthen Transformers' structural biases for better performance on syntactic and semantic tasks.
Findings
Pre-training improves few-shot syntactic task performance.
Enhanced structural generalization in semantic parsing.
Attention heads track syntactic transformations effectively.
Abstract
Models need appropriate inductive biases to effectively learn from small amounts of data and generalize systematically outside of the training distribution. While Transformers are highly versatile and powerful, they can still benefit from enhanced structural inductive biases for seq2seq tasks, especially those involving syntactic transformations, such as converting active to passive voice or semantic parsing. In this paper, we propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training to perform synthetically generated syntactic transformations of dependency trees given a description of the transformation. Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking, and also improves structural generalization for semantic parsing. Our analysis shows that the intermediate pre-training leads to attention heads…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Intelligent Tutoring Systems and Adaptive Learning · Neural Networks and Applications
MethodsAttention Is All You Need · Sigmoid Activation · Linear Layer · Tanh Activation · Multi-Head Attention · Long Short-Term Memory · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing
