Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs

Matthew Roddy; Gabriel Skantze; Naomi Harte

arXiv:1808.10785·cs.CL·September 3, 2018

Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs

Matthew Roddy, Gabriel Skantze, Naomi Harte

PDF

1 Repo

TL;DR

This paper introduces a multiscale RNN architecture that models multiple modalities at different temporal granularities to improve turn-taking prediction in conversational systems, incorporating linguistic, acoustic, and gaze cues.

Contribution

The paper presents a novel multiscale RNN approach that models modalities at separate timescales, enhancing turn-taking prediction in dialogue systems.

Findings

01

Modeling modalities at different timescales improves turn-taking accuracy.

02

Incorporating gaze features enhances the model's performance.

03

Multiscale RNNs outperform single-scale models in experiments.

Abstract

In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities should be modeled. We design a multiscale RNN architecture to model modalities at separate timescales in a continuous manner. Our results show that modeling linguistic and acoustic features at separate temporal rates can be beneficial for turn-taking modeling. We also show that our approach can be used to incorporate gaze features into turn-taking models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mattroddy/lstm_turn_taking_prediction
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.