Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-based Multimodal Fusion
Baptiste Pouthier, Laurent Pilati, Leela K. Gudupudi, Charles, Bouveyron, Frederic Precioso

TL;DR
This paper introduces a multi-objective learning framework with uncertainty-based multimodal fusion for active speaker detection, effectively combining audio and video data to outperform existing methods and improve detection accuracy.
Contribution
It proposes a novel self-attention, uncertainty-based multimodal fusion scheme within a multi-objective learning architecture for active speaker detection.
Findings
Outperforms traditional fusion approaches in mAP and AUC scores.
Surpasses other modality fusion methods across disciplines.
Significantly improves state-of-the-art results on AVA-ActiveSpeaker dataset.
Abstract
It is now well established from a variety of studies that there is a significant benefit from combining video and audio data in detecting active speakers. However, either of the modalities can potentially mislead audiovisual fusion by inducing unreliable or deceptive information. This paper outlines active speaker detection as a multi-objective learning problem to leverage best of each modalities using a novel self-attention, uncertainty-based multimodal fusion scheme. Results obtained show that the proposed multi-objective learning architecture outperforms traditional approaches in improving both mAP and AUC scores. We further demonstrate that our fusion strategy surpasses, in active speaker detection, other modality fusion methods reported in various disciplines. We finally show that the proposed method significantly improves the state-of-the-art on the AVA-ActiveSpeaker dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
