Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures

Shun Inadumi; Nobuhiro Ueda; Koichiro Yoshino

arXiv:2505.11726·cs.CL·June 3, 2025

Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures

Shun Inadumi, Nobuhiro Ueda, Koichiro Yoshino

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a unified framework for resolving references in visually grounded dialogues by jointly modeling textual and multimodal semantic structures, improving pronoun grounding and ambiguity resolution.

Contribution

It proposes a novel joint modeling approach that integrates textual and multimodal reference resolution, enhancing performance in pronoun and elliptical phrase grounding tasks.

Findings

01

Joint modeling improves reference resolution accuracy.

02

Incorporating coreference resolution enhances pronoun grounding.

03

Qualitative analysis shows reduced ambiguity in dialogues.

Abstract

Multimodal reference resolution, including phrase grounding, aims to understand the semantic relations between mentions and real-world objects. Phrase grounding between images and their captions is a well-established task. In contrast, for real-world applications, it is essential to integrate textual and multimodal reference resolution to unravel the reference relations within dialogue, especially in handling ambiguities caused by pronouns and ellipses. This paper presents a framework that unifies textual and multimodal reference resolution by mapping mention embeddings to object embeddings and selecting mentions or objects based on their similarity. Our experiments show that learning textual reference resolution, such as coreference resolution and predicate-argument structure analysis, positively affects performance in multimodal reference resolution. In particular, our model with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sinadumi/mmrr
pytorchOfficial

Videos

Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsMDETR