Real-time and Continuous Turn-taking Prediction Using Voice Activity   Projection

Koji Inoue; Bing'er Jiang; Erik Ekstedt; Tatsuya Kawahara; Gabriel; Skantze

arXiv:2401.04868·cs.CL·January 11, 2024·2 cites

Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel, Skantze

PDF

Open Access 2 Repos 1 Models

TL;DR

This paper introduces a real-time turn-taking prediction system using a voice activity projection model that leverages contrastive predictive coding and self-attention transformers to predict future dialogue voice activity from stereo audio.

Contribution

The paper presents a novel VAP model combining CPC and transformers for real-time turn-taking prediction directly from stereo audio.

Findings

01

System operates in real-time with CPU settings

02

Minimal performance degradation with varying input context length

03

Effective voice activity prediction from stereo audio

Abstract

A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
marcosremar2/turn-taking-study
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Speech and dialogue systems

MethodsInfoNCE · Contrastive Predictive Coding