Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models

Bram Willemsen; Gabriel Skantze

arXiv:2506.21294·cs.CL·June 27, 2025

Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models

Bram Willemsen, Gabriel Skantze

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates using autoregressive language models trained on text alone to detect referring expressions in visually grounded dialogue, emphasizing the importance of linguistic context in a multimodal task.

Contribution

It demonstrates that a text-only, autoregressive language model can effectively identify referring expressions in visually grounded dialogue, highlighting the potential of linguistic context alone.

Findings

01

Text-only models can detect mentions effectively

02

Small datasets and moderate-sized models suffice

03

Linguistic context plays a crucial role

Abstract

In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

willemsenbram/mention-detection-vgd
noneOfficial

Videos

Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models· underline

Taxonomy

TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media