Joint Answering and Explanation for Visual Commonsense Reasoning
Zhenyang Li, Yangyang Guo, Kejie Wang, Yinwei Wei, Liqiang Nie, Mohan, Kankanhalli

TL;DR
This paper introduces a knowledge distillation framework that couples question answering and rationale inference in visual commonsense reasoning, improving performance by addressing their previous disconnection.
Contribution
It proposes a novel, model-agnostic framework with a bridging branch to better connect VCR processes, enhancing existing methods' effectiveness.
Findings
Significant performance improvements on VCR benchmarks.
Effective coupling of question answering and rationale inference.
Framework is compatible with existing models.
Abstract
Visual Commonsense Reasoning (VCR), deemed as one challenging extension of the Visual Question Answering (VQA), endeavors to pursue a more high-level visual comprehension. It is composed of two indispensable processes: question answering over a given image and rationale inference for answer explanation. Over the years, a variety of methods tackling VCR have advanced the performance on the benchmark dataset. Despite significant as these methods are, they often treat the two processes in a separate manner and hence decompose the VCR into two irrelevant VQA instances. As a result, the pivotal connection between question answering and rationale inference is interrupted, rendering existing efforts less faithful on visual reasoning. To empirically study this issue, we perform some in-depth explorations in terms of both language shortcuts and generalization capability to verify the pitfalls of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsKnowledge Distillation
