Recursive Visual Attention in Visual Dialog

Yulei Niu; Hanwang Zhang; Manli Zhang; Jianhong Zhang; Zhiwu Lu,; Ji-Rong Wen

arXiv:1812.02664·cs.CV·April 9, 2019·5 cites

Recursive Visual Attention in Visual Dialog

Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu,, Ji-Rong Wen

PDF

Open Access 1 Repo

TL;DR

This paper introduces Recursive Visual Attention (RvA), a novel mechanism for visual dialog that iteratively refines visual focus to resolve co-reference, outperforming existing methods on large-scale datasets.

Contribution

The paper proposes RvA, a recursive attention mechanism that improves visual co-reference resolution in dialog systems without extra annotations.

Findings

01

RvA outperforms state-of-the-art methods on VisDial datasets.

02

RvA provides interpretable attention maps.

03

RvA achieves effective recursion in visual attention.

Abstract

Visual dialog is a challenging vision-language task, which requires the agent to answer multi-round questions about an image. It typically needs to address two major problems: (1) How to answer visually-grounded questions, which is the core challenge in visual question answering (VQA); (2) How to infer the co-reference between questions and the dialog history. An example of visual co-reference is: pronouns (\eg, ``they'') in the question (\eg, ``Are they on or off?'') are linked with nouns (\eg, ``lamps'') appearing in the dialog history (\eg, ``How many lamps are there?'') and the object grounded in the image. In this work, to resolve the visual co-reference for visual dialog, we propose a novel attention mechanism called Recursive Visual Attention (RvA). Specifically, our dialog agent browses the dialog history until the agent has sufficient confidence in the visual co-reference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuleiniu/rva
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition