Learning to Ground Visual Objects for Visual Dialog

Feilong Chen; Xiuyi Chen; Can Xu; Daxin Jiang

arXiv:2109.06013·cs.CV·June 1, 2022

Learning to Ground Visual Objects for Visual Dialog

Feilong Chen, Xiuyi Chen, Can Xu, Daxin Jiang

PDF

TL;DR

This paper introduces a novel visual object grounding mechanism for visual dialog that uses prior and posterior distributions to improve the accuracy of grounding related objects, enhancing dialog performance.

Contribution

It proposes a new grounding approach employing prior and posterior distributions over visual objects, improving grounding accuracy during training and inference.

Findings

01

Significant performance improvements on VisDial v0.9 and v1.0 datasets.

02

Enhanced grounding accuracy in both generative and discriminative models.

03

Outperforms previous strong models in visual dialog tasks.

Abstract

Visual dialog is challenging since it needs to answer a series of coherent questions based on understanding the visual environment. How to ground related visual objects is one of the key problems. Previous studies utilize the question and history to attend to the image and achieve satisfactory performance, however these methods are not sufficient to locate related visual objects without any guidance. The inappropriate grounding of visual objects prohibits the performance of visual dialog models. In this paper, we propose a novel approach to Learn to Ground visual objects for visual dialog, which employs a novel visual objects grounding mechanism where both prior and posterior distributions over visual objects are used to facilitate visual objects grounding. Specifically, a posterior distribution over visual objects is inferred from both context (history and questions) and answers, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.