Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions

Jihyoung Jang; Minwook Bae; Minji Kim; Dilek Hakkani-Tur; Hyounghun Kim

arXiv:2506.00421·cs.CL·June 3, 2025

Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions

Jihyoung Jang, Minwook Bae, Minji Kim, Dilek Hakkani-Tur, Hyounghun Kim

PDF

Open Access 2 Models 1 Datasets 1 Video

TL;DR

This paper introduces a new multimodal conversation dataset and a novel model enabling chatbots to process visual and auditory inputs for more immersive, dynamic, and long-term multi-party interactions in real-world scenarios.

Contribution

The study presents a new multimodal dataset ($M^3C$) and a multimodal conversation model with memory retrieval, advancing natural, multi-party, multi-session chatbot interactions.

Findings

01

Model effectively processes visual and auditory inputs.

02

Human evaluations show strong performance in dynamic conversations.

03

Demonstrates potential for immersive multimodal chatbots.

Abstract

As chatbots continue to evolve toward human-like, real-world, interactions, multimodality remains an active area of research and exploration. So far, efforts to integrate multimodality into chatbots have primarily focused on image-centric tasks, such as visual dialogue and image-based instructions, placing emphasis on the "eyes" of human perception while neglecting the "ears", namely auditory aspects. Moreover, these studies often center around static interactions that focus on discussing the modality rather than naturally incorporating it into the conversation, which limits the richness of simultaneous, dynamic engagement. Furthermore, while multimodality has been explored in multi-party and multi-session conversations, task-specific constraints have hindered its seamless integration into dynamic, natural conversations. To address these challenges, this study aims to equip chatbots…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

jihyoung/M3C
dataset· 35 dl
35 dl

Videos

Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions· underline

Taxonomy

TopicsAI in Service Interactions · Speech and dialogue systems