VoiceVector: Multimodal Enrolment Vectors for Speaker Separation

Akam Rahimi; Triantafyllos Afouras; Andrew Zisserman

arXiv:2501.01401·eess.AS·January 3, 2025·ICASSP

VoiceVector: Multimodal Enrolment Vectors for Speaker Separation

Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman

PDF

Open Access

TL;DR

VoiceVector introduces a transformer-based multimodal system for speaker separation, utilizing flexible enrolment vectors from audio, visual, or combined data, and conditioning on multiple speakers for improved accuracy.

Contribution

The paper presents a novel multimodal enrolment vector approach that can be derived from various data types and conditioned on multiple speakers, enhancing separation performance.

Findings

01

Superior performance over previous methods

02

Flexible enrolment vector generation from multiple modalities

03

Effective separation conditioned on multiple speakers

Abstract

We present a transformer-based architecture for voice separation of a target speaker from multiple other speakers and ambient noise. We achieve this by using two separate neural networks: (A) An enrolment network designed to craft speaker-specific embeddings, exploiting various combinations of audio and visual modalities; and (B) A separation network that accepts both the noisy signal and enrolment vectors as inputs, outputting the clean signal of the target speaker. The novelties are: (i) the enrolment vector can be produced from: audio only, audio-visual data (using lip movements) or visual data alone (using lip movements from silent video); and (ii) the flexibility in conditioning the separation on multiple positive and negative enrolment vectors. We compare with previous methods and obtain superior performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing