Joint Answering and Explanation for Visual Commonsense Reasoning

Zhenyang Li; Yangyang Guo; Kejie Wang; Yinwei Wei; Liqiang Nie; Mohan; Kankanhalli

arXiv:2202.12626·cs.CV·July 26, 2023·1 cites

Joint Answering and Explanation for Visual Commonsense Reasoning

Zhenyang Li, Yangyang Guo, Kejie Wang, Yinwei Wei, Liqiang Nie, Mohan, Kankanhalli

PDF

Open Access 1 Repo

TL;DR

This paper introduces a knowledge distillation framework that couples question answering and rationale inference in visual commonsense reasoning, improving performance by addressing their previous disconnection.

Contribution

It proposes a novel, model-agnostic framework with a bridging branch to better connect VCR processes, enhancing existing methods' effectiveness.

Findings

01

Significant performance improvements on VCR benchmarks.

02

Effective coupling of question answering and rationale inference.

03

Framework is compatible with existing models.

Abstract

Visual Commonsense Reasoning (VCR), deemed as one challenging extension of the Visual Question Answering (VQA), endeavors to pursue a more high-level visual comprehension. It is composed of two indispensable processes: question answering over a given image and rationale inference for answer explanation. Over the years, a variety of methods tackling VCR have advanced the performance on the benchmark dataset. Despite significant as these methods are, they often treat the two processes in a separate manner and hence decompose the VCR into two irrelevant VQA instances. As a result, the pivotal connection between question answering and rationale inference is interrupted, rendering existing efforts less faithful on visual reasoning. To empirically study this issue, we perform some in-depth explorations in terms of both language shortcuts and generalization capability to verify the pitfalls of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sdlzy/arc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsKnowledge Distillation