Extending Phrase Grounding with Pronouns in Visual Dialogues

Panzhong Lu; Xin Zhang; Meishan Zhang; Min Zhang

arXiv:2210.12658·cs.CL·October 25, 2022

Extending Phrase Grounding with Pronouns in Visual Dialogues

Panzhong Lu, Xin Zhang, Meishan Zhang, Min Zhang

PDF

Open Access 1 Repo

TL;DR

This paper extends phrase grounding in visual dialogues to include pronouns, introduces a new dataset, and demonstrates that coreference-aware models improve grounding accuracy for both nouns and pronouns.

Contribution

It creates a dataset for pronoun and noun phrase grounding, and proposes a coreference-enhanced model using graph convolutional networks to improve cross-modal grounding.

Findings

01

Pronouns are easier to ground than noun phrases.

02

Coreference information significantly improves grounding performance.

03

The proposed model outperforms baseline methods.

Abstract

Conventional phrase grounding aims to localize noun phrases mentioned in a given caption to their corresponding image regions, which has achieved great success recently. Apparently, sole noun phrase grounding is not enough for cross-modal visual language understanding. Here we extend the task by considering pronouns as well. First, we construct a dataset of phrase grounding with both noun phrases and pronouns to image regions. Based on the dataset, we test the performance of phrase grounding by using a state-of-the-art literature model of this line. Then, we enhance the baseline grounding model with coreference information which should help our task potentially, modeling the coreference structures with graph convolutional networks. Experiments on our dataset, interestingly, show that pronouns are easier to ground than noun phrases, where the possible reason might be that these pronouns…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

izhx/phrase-grounding-with-pronoun
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Video Analysis and Summarization

MethodsTest