Gated Multimodal Fusion with Contrastive Learning for Turn-taking   Prediction in Human-robot Dialogue

Jiudong Yang; Peiying Wang; Yi Zhu; Mingchao Feng; Meng Chen; Xiaodong; He

arXiv:2204.10172·eess.AS·April 22, 2022

Gated Multimodal Fusion with Contrastive Learning for Turn-taking Prediction in Human-robot Dialogue

Jiudong Yang, Peiying Wang, Yi Zhu, Mingchao Feng, Meng Chen, Xiaodong, He

PDF

Open Access

TL;DR

This paper introduces a large-scale multimodal dataset and a novel gated fusion model with contrastive learning to improve turn-taking prediction in human-robot dialogue, addressing data imbalance and modality integration challenges.

Contribution

It presents a new large-scale dataset, a gated multimodal fusion mechanism, and a contrastive learning approach to enhance turn-taking prediction in dialogue systems.

Findings

01

The proposed model outperforms several state-of-the-art baselines.

02

Contrastive learning improves feature representations for turn-taking.

03

Data augmentation effectively addresses class imbalance.

Abstract

Turn-taking, aiming to decide when the next speaker can start talking, is an essential component in building human-robot spoken dialogue systems. Previous studies indicate that multimodal cues can facilitate this challenging task. However, due to the paucity of public multimodal datasets, current methods are mostly limited to either utilizing unimodal features or simplistic multimodal ensemble models. Besides, the inherent class imbalance in real scenario, e.g. sentence ending with short pause will be mostly regarded as the end of turn, also poses great challenge to the turn-taking decision. In this paper, we first collect a large-scale annotated corpus for turn-taking with over 5,000 real human-robot dialogues in speech and text modalities. Then, a novel gated multimodal fusion mechanism is devised to utilize various information seamlessly for turn-taking prediction. More importantly,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Interpreting and Communication in Healthcare · Language, Metaphor, and Cognition

MethodsContrastive Learning