TL;DR
This paper introduces a multiscale RNN architecture that models multiple modalities at different temporal granularities to improve turn-taking prediction in conversational systems, incorporating linguistic, acoustic, and gaze cues.
Contribution
The paper presents a novel multiscale RNN approach that models modalities at separate timescales, enhancing turn-taking prediction in dialogue systems.
Findings
Modeling modalities at different timescales improves turn-taking accuracy.
Incorporating gaze features enhances the model's performance.
Multiscale RNNs outperform single-scale models in experiments.
Abstract
In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities should be modeled. We design a multiscale RNN architecture to model modalities at separate timescales in a continuous manner. Our results show that modeling linguistic and acoustic features at separate temporal rates can be beneficial for turn-taking modeling. We also show that our approach can be used to incorporate gaze features into turn-taking models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
