Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation
Se Jin Park, Chae Won Kim, Hyeongseop Rha, Minsu Kim, Joanna Hong,, Jeong Hun Yeo, Yong Man Ro

TL;DR
This paper presents a novel face-to-face spoken dialogue model that generates audio-visual responses directly from multimodal input, advancing avatar chatbot technology without intermediate text, supported by a large-scale multimodal dialogue dataset.
Contribution
Introduces MultiDialog, the first large-scale multimodal dialogue corpus, and develops a face-to-face spoken dialogue model integrating large language models with speech-text pretraining.
Findings
Model effectively facilitates face-to-face conversations.
MultiDialog dataset enables multimodal synthesis research.
Demonstrates the feasibility of audio-visual dialogue generation.
Abstract
In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e., audio and visual) spoken dialogue corpus containing 340 hours of approximately 9,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation
