VoiceVector: Multimodal Enrolment Vectors for Speaker Separation
Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman

TL;DR
VoiceVector introduces a transformer-based multimodal system for speaker separation, utilizing flexible enrolment vectors from audio, visual, or combined data, and conditioning on multiple speakers for improved accuracy.
Contribution
The paper presents a novel multimodal enrolment vector approach that can be derived from various data types and conditioned on multiple speakers, enhancing separation performance.
Findings
Superior performance over previous methods
Flexible enrolment vector generation from multiple modalities
Effective separation conditioned on multiple speakers
Abstract
We present a transformer-based architecture for voice separation of a target speaker from multiple other speakers and ambient noise. We achieve this by using two separate neural networks: (A) An enrolment network designed to craft speaker-specific embeddings, exploiting various combinations of audio and visual modalities; and (B) A separation network that accepts both the noisy signal and enrolment vectors as inputs, outputting the clean signal of the target speaker. The novelties are: (i) the enrolment vector can be produced from: audio only, audio-visual data (using lip movements) or visual data alone (using lip movements from silent video); and (ii) the flexibility in conditioning the separation on multiple positive and negative enrolment vectors. We compare with previous methods and obtain superior performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
