Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task
Shailaja Keyur Sampat, Pratyay Banerjee, Yezhou Yang, Chitta Baral

TL;DR
This paper introduces a new learning strategy for modeling action effects in vision-language reasoning tasks, enhancing understanding of hypothetical scenarios involving actions and changes.
Contribution
It proposes an encoder-decoder architecture to learn action representations and integrates it with existing models for improved reasoning on the CLEVR_HYP dataset.
Findings
Improved reasoning accuracy over baselines
Enhanced data efficiency in learning
Better generalization to unseen scenarios
Abstract
'Actions' play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform 'Reasoning about Actions & Change' (RAC). This has been an important research direction in Artificial Intelligence (AI) in general, but the study of RAC with visual and linguistic inputs is relatively recent. The CLEVR_HYP (Sampat et. al., 2021) is one such testbed for hypothetical vision-language reasoning with actions as the key focus. In this work, we propose a novel learning strategy that can improve reasoning about the effects of actions. We implement an encoder-decoder architecture to learn the representation of actions as vectors. We combine the aforementioned encoder-decoder architecture with existing modality parsers and a scene graph question answering model to evaluate our proposed system on the CLEVR_HYP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
