Enforcing Reasoning in Visual Commonsense Reasoning
Hammad A. Ayyubi, Md. Mehrab Tanjim, David J. Kriegman

TL;DR
This paper introduces an end-to-end trainable model for Visual Commonsense Reasoning that jointly predicts answers and rationales, overcoming limitations of previous separate training approaches and employing multiple training strategies.
Contribution
It proposes a novel joint modeling approach with four training methods to improve reasoning in visual question answering tasks.
Findings
Model performs competitively with state-of-the-art methods.
Joint answer and rationale prediction enhances reasoning capabilities.
Multiple training strategies offer flexible solutions for differentiability issues.
Abstract
The task of Visual Commonsense Reasoning is extremely challenging in the sense that the model has to not only be able to answer a question given an image, but also be able to learn to reason. The baselines introduced in this task are quite limiting because two networks are trained for predicting answers and rationales separately. Question and image is used as input to train answer prediction network while question, image and correct answer are used as input in the rationale prediction network. As rationale is conditioned on the correct answer, it is based on the assumption that we can solve Visual Question Answering task without any error - which is over ambitious. Moreover, such an approach makes both answer and rationale prediction two completely independent VQA tasks rendering cognition task meaningless. In this paper, we seek to address these issues by proposing an end-to-end…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
