Jointly Learning Truth-Conditional Denotations and Groundings using Parallel Attention
Leon Bergen, Dzmitry Bahdanau, Timothy J. O'Donnell

TL;DR
This paper introduces a neurosymbolic model that jointly learns word denotations and groundings using parallel attention, achieving state-of-the-art visual question answering performance by grounding objects in images based solely on question signals.
Contribution
It proposes a novel parallel attention mechanism for jointly learning denotations and groundings within a truth-conditional semantic framework.
Findings
Achieves state-of-the-art VQA performance on CLEVR.
Learns to ground objects using only question-based training signals.
Can adapt to non-canonical groundings by modifying training answers.
Abstract
We present a model that jointly learns the denotations of words together with their groundings using a truth-conditional semantics. Our model builds on the neurosymbolic approach of Mao et al. (2019), learning to ground objects in the CLEVR dataset (Johnson et al., 2017) using a novel parallel attention mechanism. The model achieves state of the art performance on visual question answering, learning to detect and ground objects with question performance as the only training signal. We also show that the model is able to learn flexible non-canonical groundings just by adjusting answers to questions in the training set.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
