Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems

Mikey Elmers; Koji Inoue; Divesh Lala; Tatsuya Kawahara

arXiv:2507.07518·cs.CL·October 6, 2025

Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems

Mikey Elmers, Koji Inoue, Divesh Lala, Tatsuya Kawahara

PDF

TL;DR

This paper extends voice activity projection (VAP) to triadic multi-party conversations, demonstrating its effectiveness for turn-taking prediction in spoken dialogue systems, with models trained on Japanese datasets showing promising results.

Contribution

First to adapt VAP for triadic conversations, enabling turn-taking prediction in multi-party spoken dialogue systems.

Findings

01

VAP trained on triadic data outperforms baselines

02

Conversation type influences prediction accuracy

03

Triadic VAP can be integrated into dialogue systems

Abstract

Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.