Making Transformers Solve Compositional Tasks
Santiago Onta\~n\'on, Joshua Ainslie, Vaclav Cvicek, Zachary Fisher

TL;DR
This paper investigates how different Transformer design choices affect their ability to generalize compositionally in NLP tasks, leading to improved configurations that outperform previous models on key benchmarks.
Contribution
The study systematically explores Transformer design space, identifying configurations that enhance compositional generalization and achieve state-of-the-art results on multiple benchmarks.
Findings
Identified Transformer configurations with significantly improved compositional generalization.
Achieved state-of-the-art results on COGS and PCFG benchmarks.
Demonstrated the impact of inductive biases on model generalization.
Abstract
Several studies have reported the inability of Transformer models to generalize compositionally, a key type of generalization in many NLP tasks such as semantic parsing. In this paper we explore the design space of Transformer models showing that the inductive biases given to the model by several design decisions significantly impact compositional generalization. Through this exploration, we identified Transformer configurations that generalize compositionally significantly better than previously reported in the literature in a diverse set of compositional tasks, and that achieve state-of-the-art results in a semantic parsing compositional generalization benchmark (COGS), and a string edit operation composition benchmark (PCFG).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Dropout · Layer Normalization · Byte Pair Encoding · Adam
