Syntax-Guided Transformers: Elevating Compositional Generalization and Grounding in Multimodal Environments
Danial Kamali, Parisa Kordjamshidi

TL;DR
This paper enhances multimodal AI models' ability to generalize compositionally by leveraging syntactic structures and attention masking, leading to improved grounding and state-of-the-art performance.
Contribution
It introduces syntactic information integration into transformers for multimodal grounding, demonstrating improved compositional generalization and parameter efficiency.
Findings
Dependency parsing improves grounding performance.
Syntactic attention masking enhances compositional generalization.
Weight sharing across Transformer encoders boosts results.
Abstract
Compositional generalization, the ability of intelligent models to extrapolate understanding of components to novel compositions, is a fundamental yet challenging facet in AI research, especially within multimodal environments. In this work, we address this challenge by exploiting the syntactic structure of language to boost compositional generalization. This paper elevates the importance of syntactic grounding, particularly through attention masking techniques derived from text input parsing. We introduce and evaluate the merits of using syntactic information in the multimodal grounding problem. Our results on grounded compositional generalization underscore the positive impact of dependency parsing across diverse tasks when utilized with Weight Sharing across the Transformer encoder. The results push the state-of-the-art in multimodal grounding and parameter-efficient modeling and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsAttention Is All You Need · Label Smoothing · Linear Layer · Absolute Position Encodings · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Dropout · Softmax · Dense Connections
