Resolving References in Visually-Grounded Dialogue via Text Generation

Bram Willemsen; Livia Qian; Gabriel Skantze

arXiv:2309.13430·cs.CL·September 26, 2023

Resolving References in Visually-Grounded Dialogue via Text Generation

Bram Willemsen, Livia Qian, Gabriel Skantze

PDF

Open Access 1 Repo

TL;DR

This paper presents a method that combines fine-tuned large language models and vision-language models to improve reference resolution in visually-grounded dialogue, achieving better results than baseline methods.

Contribution

The authors propose a novel approach that uses LLM-generated descriptions to enhance referent identification in dialogue, advancing discourse processing in vision-language tasks.

Findings

01

Our method outperforms baseline models on a manually annotated dataset.

02

Using larger context windows for descriptions improves performance.

03

Zero-shot referent identification becomes more effective with generated descriptions.

Abstract

Vision-language models (VLMs) have shown to be effective at image retrieval based on simple text queries, but text-image retrieval based on conversational input remains a challenge. Consequently, if we want to use VLMs for reference resolution in visually-grounded dialogue, the discourse processing capabilities of these models need to be augmented. To address this issue, we propose fine-tuning a causal large language model (LLM) to generate definite descriptions that summarize coreferential information found in the linguistic context of references. We then use a pretrained VLM to identify referents based on the generated descriptions, zero-shot. We evaluate our approach on a manually annotated dataset of visually-grounded dialogues and achieve results that, on average, exceed the performance of the baselines we compare against. Furthermore, we find that using referent descriptions based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

willemsenbram/reference-resolution-via-text-generation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques