Granular Multimodal Attention Networks for Visual Dialog

Badri N. Patro; Shivansh Patel; Vinay P. Namboodiri

arXiv:1910.05728·cs.CV·October 15, 2019·1 cites

Granular Multimodal Attention Networks for Visual Dialog

Badri N. Patro, Shivansh Patel, Vinay P. Namboodiri

PDF

Open Access

TL;DR

This paper introduces a granular multi-modal attention approach for visual dialog, emphasizing the importance of attention scale and demonstrating improved performance by jointly attending to image and text granules.

Contribution

It proposes a novel granular multi-modal attention method that optimally addresses attention scale for visual dialog tasks, outperforming existing models.

Findings

01

Improved accuracy in visual dialog tasks.

02

Joint attention on image and text granules yields best performance.

03

Granular attention enhances model interpretability.

Abstract

Vision and language tasks have benefited from attention. There have been a number of different attention models proposed. However, the scale at which attention needs to be applied has not been well examined. Particularly, in this work, we propose a new method Granular Multi-modal Attention, where we aim to particularly address the question of the right granularity at which one needs to attend while solving the Visual Dialog task. The proposed method shows improvement in both image and text attention networks. We then propose a granular Multi-modal Attention network that jointly attends on the image and text granules and shows the best performance. With this work, we observe that obtaining granular attention and doing exhaustive Multi-modal Attention appears to be the best way to attend while solving visual dialog.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques