When Can Transformers Ground and Compose: Insights from Compositional Generalization Benchmarks
Ankur Sikarwar, Arkil Patel, Navin Goyal

TL;DR
This paper demonstrates that transformers can effectively perform grounded compositional reasoning in navigation tasks, outperforming specialized models, and provides insights into their generalization capabilities and underlying computations.
Contribution
It introduces a simple transformer-based model that surpasses specialized architectures on grounding benchmarks and offers a mathematical analysis of its reasoning process.
Findings
Transformers outperform specialized models on ReaSCAN and gSCAN.
A specific split testing depth generalization is unfair, but transformers can generalize with an amended split.
A single self-attention layer with one head can generalize to new object attribute combinations.
Abstract
Humans can reason compositionally whilst grounding language utterances to the real world. Recent benchmarks like ReaSCAN use navigation tasks grounded in a grid world to assess whether neural models exhibit similar capabilities. In this work, we present a simple transformer-based model that outperforms specialized architectures on ReaSCAN and a modified version of gSCAN. On analyzing the task, we find that identifying the target location in the grid world is the main challenge for the models. Furthermore, we show that a particular split in ReaSCAN, which tests depth generalization, is unfair. On an amended version of this split, we show that transformers can generalize to deeper input structures. Finally, we design a simpler grounded compositional generalization task, RefEx, to investigate how transformers reason compositionally. We show that a single self-attention layer with a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
