Learning to Agree on Vision Attention for Visual Commonsense Reasoning
Zhenyang Li, Yangyang Guo, Kejie Wang, Fan Liu, Liqiang Nie, Mohan, Kankanhalli

TL;DR
This paper introduces a novel visual attention alignment method for Visual Commonsense Reasoning that unifies question answering and rationale prediction processes, leading to improved model performance on the VCR benchmark.
Contribution
It proposes a re-attention module to align attention maps between the two processes, enhancing their cooperation in a unified framework.
Findings
Significant performance improvement on VCR benchmark.
Effective alignment of attention maps improves reasoning accuracy.
Applicable to both conventional and Transformer-based models.
Abstract
Visual Commonsense Reasoning (VCR) remains a significant yet challenging research problem in the realm of visual reasoning. A VCR model generally aims at answering a textual question regarding an image, followed by the rationale prediction for the preceding answering process. Though these two processes are sequential and intertwined, existing methods always consider them as two independent matching-based instances. They, therefore, ignore the pivotal relationship between the two processes, leading to sub-optimal model performance. This paper presents a novel visual attention alignment method to efficaciously handle these two processes in a unified framework. To achieve this, we first design a re-attention module for aggregating the vision attention map produced in each process. Thereafter, the resultant two sets of attention maps are carefully aligned to guide the two processes to make…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsAttention Is All You Need · Dense Connections · Linear Layer · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Adam
