A Benchmark for Systematic Generalization in Grounded Language Understanding
Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, Brenden M., Lake

TL;DR
This paper introduces gSCAN, a new benchmark for evaluating how well models can generalize compositionally in grounded language understanding, highlighting current models' limitations in systematic generalization.
Contribution
The paper presents gSCAN, a novel benchmark grounded in a grid world, to evaluate compositional generalization in situated language understanding, extending prior syntactic-focused benchmarks.
Findings
Models struggle with systematic compositional generalization
Baseline models fail dramatically on novel compositional tasks
gSCAN enables evaluation of linguistically motivated rule learning
Abstract
Humans easily interpret expressions that describe unfamiliar situations composed from familiar parts ("greet the pink brontosaurus by the ferris wheel"). Modern neural networks, by contrast, struggle to interpret novel compositions. In this paper, we introduce a new benchmark, gSCAN, for evaluating compositional generalization in situated language understanding. Going beyond a related benchmark that focused on syntactic aspects of generalization, gSCAN defines a language grounded in the states of a grid world, facilitating novel evaluations of acquiring linguistically motivated rules. For example, agents must understand how adjectives such as 'small' are interpreted relative to the current world state or how adverbs such as 'cautiously' combine with new verbs. We test a strong multi-modal baseline model and a state-of-the-art compositional method finding that, in most cases, they fail…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
