Multimodal Incremental Transformer with Visual Grounding for Visual   Dialogue Generation

Feilong Chen; Fandong Meng; Xiuyi Chen; Peng Li; Jie Zhou

arXiv:2109.08478·cs.CL·September 20, 2021·1 cites

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Feilong Chen, Fandong Meng, Xiuyi Chen, Peng Li, Jie Zhou

PDF

Open Access 1 Repo

TL;DR

This paper introduces MITVG, a multimodal incremental transformer with explicit visual grounding for improved visual dialogue generation, explicitly locating objects in images to enhance response coherence.

Contribution

The paper proposes a novel model combining visual grounding with an incremental transformer to explicitly locate objects, improving visual dialogue generation.

Findings

01

Achieves comparable performance on VisDial datasets.

02

Effectively locates objects to improve response relevance.

03

Outperforms previous implicit co-reference methods.

Abstract

Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a {\bf M}ultimodal {\bf I}ncremental {\bf T}ransformer with {\bf V}isual {\bf G}rounding, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly locate related objects in the image guided by textual entities, which helps the model exclude the visual content that does not need attention. On the basis of visual grounding, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zyang-ur/onestage_grounding
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning