Learning to Agree on Vision Attention for Visual Commonsense Reasoning

Zhenyang Li; Yangyang Guo; Kejie Wang; Fan Liu; Liqiang Nie; Mohan; Kankanhalli

arXiv:2302.02117·cs.CV·February 21, 2023

Learning to Agree on Vision Attention for Visual Commonsense Reasoning

Zhenyang Li, Yangyang Guo, Kejie Wang, Fan Liu, Liqiang Nie, Mohan, Kankanhalli

PDF

Open Access

TL;DR

This paper introduces a novel visual attention alignment method for Visual Commonsense Reasoning that unifies question answering and rationale prediction processes, leading to improved model performance on the VCR benchmark.

Contribution

It proposes a re-attention module to align attention maps between the two processes, enhancing their cooperation in a unified framework.

Findings

01

Significant performance improvement on VCR benchmark.

02

Effective alignment of attention maps improves reasoning accuracy.

03

Applicable to both conventional and Transformer-based models.

Abstract

Visual Commonsense Reasoning (VCR) remains a significant yet challenging research problem in the realm of visual reasoning. A VCR model generally aims at answering a textual question regarding an image, followed by the rationale prediction for the preceding answering process. Though these two processes are sequential and intertwined, existing methods always consider them as two independent matching-based instances. They, therefore, ignore the pivotal relationship between the two processes, leading to sub-optimal model performance. This paper presents a novel visual attention alignment method to efficaciously handle these two processes in a unified framework. To achieve this, we first design a re-attention module for aggregating the vision attention map produced in each process. Thereafter, the resultant two sets of attention maps are carefully aligned to guide the two processes to make…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsAttention Is All You Need · Dense Connections · Linear Layer · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Adam