Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation
Feilong Chen, Fandong Meng, Xiuyi Chen, Peng Li, Jie Zhou

TL;DR
This paper introduces MITVG, a multimodal incremental transformer with explicit visual grounding for improved visual dialogue generation, explicitly locating objects in images to enhance response coherence.
Contribution
The paper proposes a novel model combining visual grounding with an incremental transformer to explicitly locate objects, improving visual dialogue generation.
Findings
Achieves comparable performance on VisDial datasets.
Effectively locates objects to improve response relevance.
Outperforms previous implicit co-reference methods.
Abstract
Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a {\bf M}ultimodal {\bf I}ncremental {\bf T}ransformer with {\bf V}isual {\bf G}rounding, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly locate related objects in the image guided by textual entities, which helps the model exclude the visual content that does not need attention. On the basis of visual grounding, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
