Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems
Mikey Elmers, Koji Inoue, Divesh Lala, Tatsuya Kawahara

TL;DR
This paper extends voice activity projection (VAP) to triadic multi-party conversations, demonstrating its effectiveness for turn-taking prediction in spoken dialogue systems, with models trained on Japanese datasets showing promising results.
Contribution
First to adapt VAP for triadic conversations, enabling turn-taking prediction in multi-party spoken dialogue systems.
Findings
VAP trained on triadic data outperforms baselines
Conversation type influences prediction accuracy
Triadic VAP can be integrated into dialogue systems
Abstract
Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
