Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures
Shun Inadumi, Nobuhiro Ueda, Koichiro Yoshino

TL;DR
This paper introduces a unified framework for resolving references in visually grounded dialogues by jointly modeling textual and multimodal semantic structures, improving pronoun grounding and ambiguity resolution.
Contribution
It proposes a novel joint modeling approach that integrates textual and multimodal reference resolution, enhancing performance in pronoun and elliptical phrase grounding tasks.
Findings
Joint modeling improves reference resolution accuracy.
Incorporating coreference resolution enhances pronoun grounding.
Qualitative analysis shows reduced ambiguity in dialogues.
Abstract
Multimodal reference resolution, including phrase grounding, aims to understand the semantic relations between mentions and real-world objects. Phrase grounding between images and their captions is a well-established task. In contrast, for real-world applications, it is essential to integrate textual and multimodal reference resolution to unravel the reference relations within dialogue, especially in handling ambiguities caused by pronouns and ellipses. This paper presents a framework that unifies textual and multimodal reference resolution by mapping mention embeddings to object embeddings and selecting mentions or objects based on their similarity. Our experiments show that learning textual reference resolution, such as coreference resolution and predicate-argument structure analysis, positively affects performance in multimodal reference resolution. In particular, our model with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsMDETR
