Visualizing Dialogues: Enhancing Image Selection through Dialogue   Understanding with Large Language Models

Chang-Sheng Kao; Yun-Nung Chen

arXiv:2407.03615·cs.CL·July 8, 2024

Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models

Chang-Sheng Kao, Yun-Nung Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel method that uses large language models to generate accurate visual descriptors from dialogues, significantly improving dialogue-to-image retrieval performance and demonstrating broad applicability across datasets and visual cues.

Contribution

We propose leveraging large language models to generate precise visual descriptors from dialogues, overcoming limitations of existing vision-language models in complex dialogue understanding.

Findings

01

Enhanced dialogue-to-image retrieval accuracy

02

Method generalizes across datasets and visual cues

03

Effective use of LLMs for visual descriptor generation

Abstract

Recent advancements in dialogue systems have highlighted the significance of integrating multimodal responses, which enable conveying ideas through diverse modalities rather than solely relying on text-based interactions. This enrichment not only improves overall communicative efficacy but also enhances the quality of conversational experiences. However, existing methods for dialogue-to-image retrieval face limitations due to the constraints of pre-trained vision language models (VLMs) in comprehending complex dialogues accurately. To address this, we present a novel approach leveraging the robust reasoning capabilities of large language models (LLMs) to generate precise dialogue-associated visual descriptors, facilitating seamless connection with images. Extensive experiments conducted on benchmark data validate the effectiveness of our proposed approach in deriving concise and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MiuLab/VisualDialog
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Multimodal Machine Learning Applications