Learning to Ground Visual Objects for Visual Dialog
Feilong Chen, Xiuyi Chen, Can Xu, Daxin Jiang

TL;DR
This paper introduces a novel visual object grounding mechanism for visual dialog that uses prior and posterior distributions to improve the accuracy of grounding related objects, enhancing dialog performance.
Contribution
It proposes a new grounding approach employing prior and posterior distributions over visual objects, improving grounding accuracy during training and inference.
Findings
Significant performance improvements on VisDial v0.9 and v1.0 datasets.
Enhanced grounding accuracy in both generative and discriminative models.
Outperforms previous strong models in visual dialog tasks.
Abstract
Visual dialog is challenging since it needs to answer a series of coherent questions based on understanding the visual environment. How to ground related visual objects is one of the key problems. Previous studies utilize the question and history to attend to the image and achieve satisfactory performance, however these methods are not sufficient to locate related visual objects without any guidance. The inappropriate grounding of visual objects prohibits the performance of visual dialog models. In this paper, we propose a novel approach to Learn to Ground visual objects for visual dialog, which employs a novel visual objects grounding mechanism where both prior and posterior distributions over visual objects are used to facilitate visual objects grounding. Specifically, a posterior distribution over visual objects is inferred from both context (history and questions) and answers, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
