Granular Multimodal Attention Networks for Visual Dialog
Badri N. Patro, Shivansh Patel, Vinay P. Namboodiri

TL;DR
This paper introduces a granular multi-modal attention approach for visual dialog, emphasizing the importance of attention scale and demonstrating improved performance by jointly attending to image and text granules.
Contribution
It proposes a novel granular multi-modal attention method that optimally addresses attention scale for visual dialog tasks, outperforming existing models.
Findings
Improved accuracy in visual dialog tasks.
Joint attention on image and text granules yields best performance.
Granular attention enhances model interpretability.
Abstract
Vision and language tasks have benefited from attention. There have been a number of different attention models proposed. However, the scale at which attention needs to be applied has not been well examined. Particularly, in this work, we propose a new method Granular Multi-modal Attention, where we aim to particularly address the question of the right granularity at which one needs to attend while solving the Visual Dialog task. The proposed method shows improvement in both image and text attention networks. We then propose a granular Multi-modal Attention network that jointly attends on the image and text granules and shows the best performance. With this work, we observe that obtaining granular attention and doing exhaustive Multi-modal Attention appears to be the best way to attend while solving visual dialog.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
