Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Se Jin Park; Chae Won Kim; Hyeongseop Rha; Minsu Kim; Joanna Hong,; Jeong Hun Yeo; Yong Man Ro

arXiv:2406.07867·cs.CV·August 5, 2024

Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Se Jin Park, Chae Won Kim, Hyeongseop Rha, Minsu Kim, Joanna Hong,, Jeong Hun Yeo, Yong Man Ro

PDF

Open Access 1 Datasets

TL;DR

This paper presents a novel face-to-face spoken dialogue model that generates audio-visual responses directly from multimodal input, advancing avatar chatbot technology without intermediate text, supported by a large-scale multimodal dialogue dataset.

Contribution

Introduces MultiDialog, the first large-scale multimodal dialogue corpus, and develops a face-to-face spoken dialogue model integrating large language models with speech-text pretraining.

Findings

01

Model effectively facilitates face-to-face conversations.

02

MultiDialog dataset enables multimodal synthesis research.

03

Demonstrates the feasibility of audio-visual dialogue generation.

Abstract

In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e., audio and visual) spoken dialogue corpus containing 340 hours of approximately 9,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

IVLLab/MultiDialog
dataset· 878 dl
878 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation