Multilingual Turn-taking Prediction Using Voice Activity Projection
Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel, Skantze

TL;DR
This study explores voice activity prediction for turn-taking in multilingual dialogues, demonstrating that a multilingual model trained on multiple languages performs comparably to monolingual models and can identify language from speech.
Contribution
It introduces a multilingual voice activity projection model for turn-taking prediction that works effectively across English, Mandarin, and Japanese, and compares different audio encoders.
Findings
Multilingual VAP performs on par with monolingual models across languages.
The multilingual model can identify the language of the input signal.
Multilingual wav2vec 2.0 encoder shows promising results for speech prediction.
Abstract
This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The results show that a monolingual VAP model trained on one language does not make good predictions when applied to other languages. However, a multilingual model, trained on all three languages, demonstrates predictive performance on par with monolingual models across all languages. Further analyses show that the multilingual model has learned to discern the language of the input signal. We also analyze the sensitivity to pitch, a prosodic cue that is thought to be important for turn-taking.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsAttention Is All You Need · Linear Layer · Dropout · Multi-Head Attention · Position-Wise Feed-Forward Layer · Layer Normalization · Absolute Position Encodings · Softmax · Dense Connections · Label Smoothing
