Visual Coreference Resolution in Visual Dialog using Neural Module Networks
Satwik Kottur, Jos\'e M. F. Moura, Devi Parikh, Dhruv Batra, Marcus, Rohrbach

TL;DR
This paper introduces a neural module network architecture for visual coreference resolution in visual dialog, explicitly resolving references at the phrase level to improve accuracy and interpretability on complex datasets.
Contribution
It proposes two novel modules, Refer and Exclude, for explicit, grounded coreference resolution in visual dialog, advancing beyond prior implicit or coarse methods.
Findings
Achieves near-perfect accuracy on MNIST Dialog dataset.
Outperforms existing approaches on VisDial dataset.
Provides more interpretable and grounded coreference resolution.
Abstract
Visual dialog entails answering a series of questions grounded in an image, using dialog history as context. In addition to the challenges found in visual question answering (VQA), which can be seen as one-round dialog, visual dialog encompasses several more. We focus on one such problem called visual coreference resolution that involves determining which words, typically noun phrases and pronouns, co-refer to the same entity/object instance in an image. This is crucial, especially for pronouns (e.g., `it'), as the dialog agent must first link it to a previous coreference (e.g., `boat'), and only then can rely on the visual grounding of the coreference `boat' to reason about the pronoun `it'. Prior work (in visual dialog) models visual coreference resolution either (a) implicitly via a memory network over history, or (b) at a coarse level for the entire question; and not explicitly at a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsMemory Network
