FaVoA: Face-Voice Association Favours Ambiguous Speaker Detection

Hugo Carneiro; Cornelius Weber; Stefan Wermter

arXiv:2109.00577·cs.LG·September 7, 2021

FaVoA: Face-Voice Association Favours Ambiguous Speaker Detection

Hugo Carneiro, Cornelius Weber, Stefan Wermter

PDF

TL;DR

FaVoA leverages face-voice associations to improve active speaker detection, especially in ambiguous or challenging scenarios, by estimating facial representations from speech and integrating multimodal cues.

Contribution

Introduces FaVoA, a neural network that enhances speaker detection by modeling face-voice associations and effectively handling ambiguous cases.

Findings

01

Improves classification accuracy in ambiguous scenarios

02

Effectively rules out non-matching face-voice pairs

03

Quantifies modality contributions using gated-bimodal-unit architecture

Abstract

The strong relation between face and voice can aid active speaker detection systems when faces are visible, even in difficult settings, when the face of a speaker is not clear or when there are several people in the same scene. By being capable of estimating the frontal facial representation of a person from his/her speech, it becomes easier to determine whether he/she is a potential candidate for being classified as an active speaker, even in challenging cases in which no mouth movement is detected from any person in that same scene. By incorporating a face-voice association neural network into an existing state-of-the-art active speaker detection model, we introduce FaVoA (Face-Voice Association Ambiguous Speaker Detector), a neural network model that can correctly classify particularly ambiguous scenarios. FaVoA not only finds positive associations, but helps to rule out non-matching…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.