Voice Activity Projection Model with Multimodal Encoders

Takeshi Saga; Catherine Pelachaud

arXiv:2506.03980·cs.CL·June 5, 2025

Voice Activity Projection Model with Multimodal Encoders

Takeshi Saga, Catherine Pelachaud

PDF

Open Access

TL;DR

This paper introduces a multimodal voice activity projection model using pre-trained audio and face encoders to better capture social cues, improving turn-taking prediction in human-machine interactions.

Contribution

The paper presents a novel multimodal VAP model with pre-trained encoders, enhancing turn-taking prediction by capturing subtle expressions and outperforming previous models.

Findings

01

Model performs competitively on turn-taking metrics.

02

Outperforms previous state-of-the-art models in some cases.

03

Source code and models are publicly available.

Abstract

Turn-taking management is crucial for any social interaction. Still, it is challenging to model human-machine interaction due to the complexity of the social context and its multimodal nature. Unlike conventional systems based on silence duration, previous existing voice activity projection (VAP) models successfully utilized a unified representation of turn-taking behaviors as prediction targets, which improved turn-taking prediction performance. Recently, a multimodal VAP model outperformed the previous state-of-the-art model by a significant margin. In this paper, we propose a multimodal model enhanced with pre-trained audio and face encoders to improve performance by capturing subtle expressions. Our model performed competitively, and in some cases, even better than state-of-the-art models on turn-taking metrics. All the source codes and pretrained models are available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech and Audio Processing · Phonetics and Phonology Research