SIP: Injecting a Structural Inductive Bias into a Seq2Seq Model by Simulation
Matthias Lindemann, Alexander Koller, Ivan Titov

TL;DR
This paper introduces a method to enhance seq2seq models, especially Transformers, by pre-training them to simulate finite state transducers, thereby improving their systematic generalization and few-shot learning capabilities.
Contribution
The authors propose a novel pre-training approach that injects a structural inductive bias into Transformers by simulation of FSTs, improving generalization on structured tasks.
Findings
Enhanced systematic generalization to FST-like tasks.
Improved few-shot learning performance.
Models internalize FST state dynamics.
Abstract
Strong inductive biases enable learning from little data and help generalization outside of the training distribution. Popular neural architectures such as Transformers lack strong structural inductive biases for seq2seq NLP tasks on their own. Consequently, they struggle with systematic generalization beyond the training distribution, e.g. with extrapolating to longer inputs, even when pre-trained on large amounts of text. We show how a structural inductive bias can be efficiently injected into a seq2seq model by pre-training it to simulate structural transformations on synthetic data. Specifically, we inject an inductive bias towards Finite State Transducers (FSTs) into a Transformer by pre-training it to simulate FSTs given their descriptions. Our experiments show that our method imparts the desired inductive bias, resulting in improved systematic generalization and better few-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational Physics and Python Applications
MethodsMulti-Head Attention · Attention Is All You Need · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Dense Connections · Linear Layer · Label Smoothing · Absolute Position Encodings · Adam
