On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval

Seongbo Jang; Seonghyeon Lee; Dongha Lee; Hwanjo Yu

arXiv:2506.11499·cs.CL·May 5, 2026

On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval

Seongbo Jang, Seonghyeon Lee, Dongha Lee, Hwanjo Yu

PDF

TL;DR

This paper investigates methods for integrating multiple modalities in dialogue response retrieval, proposing and comparing three approaches, with experiments showing the end-to-end method's effectiveness and benefits of parameter sharing.

Contribution

It introduces a multimodal dialogue response retrieval task and compares three integration methods, highlighting the advantages of an end-to-end approach with parameter sharing.

Findings

01

End-to-end approach achieves comparable performance without intermediate steps.

02

Parameter sharing reduces parameters and improves performance.

03

Experimental results on two datasets validate the proposed methods.

Abstract

Multimodal chatbots have become one of the major topics for dialogue systems in both research community and industry. Recently, researchers have shed light on the multimodality of responses as well as dialogue contexts. This work explores how a dialogue system can output responses in various modalities such as text and image. To this end, we first formulate a multimodal dialogue response retrieval task for retrieval-based systems as the combination of three subtasks. We then propose three integration methods based on a two-step approach and an end-to-end approach, and compare the merits and demerits of each method. Experimental results on two datasets demonstrate that the end-to-end approach achieves comparable performance without an intermediate step in the two-step approach. In addition, a parameter sharing strategy not only reduces the number of parameters but also boosts performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.