Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions
Jihyoung Jang, Minwook Bae, Minji Kim, Dilek Hakkani-Tur, Hyounghun Kim

TL;DR
This paper introduces a new multimodal conversation dataset and a novel model enabling chatbots to process visual and auditory inputs for more immersive, dynamic, and long-term multi-party interactions in real-world scenarios.
Contribution
The study presents a new multimodal dataset ($M^3C$) and a multimodal conversation model with memory retrieval, advancing natural, multi-party, multi-session chatbot interactions.
Findings
Model effectively processes visual and auditory inputs.
Human evaluations show strong performance in dynamic conversations.
Demonstrates potential for immersive multimodal chatbots.
Abstract
As chatbots continue to evolve toward human-like, real-world, interactions, multimodality remains an active area of research and exploration. So far, efforts to integrate multimodality into chatbots have primarily focused on image-centric tasks, such as visual dialogue and image-based instructions, placing emphasis on the "eyes" of human perception while neglecting the "ears", namely auditory aspects. Moreover, these studies often center around static interactions that focus on discussing the modality rather than naturally incorporating it into the conversation, which limits the richness of simultaneous, dynamic engagement. Furthermore, while multimodality has been explored in multi-party and multi-session conversations, task-specific constraints have hindered its seamless integration into dynamic, natural conversations. To address these challenges, this study aims to equip chatbots…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAI in Service Interactions · Speech and dialogue systems
