Audio-video fusion strategies for active speaker detection in meetings
Lionel Pibre, Francisco Madrigal, Cyrille Equoy, Fr\'ed\'eric Lerasle,, Thomas Pellegrini, Julien Pinquier, Isabelle Ferran\'e

TL;DR
This paper explores audio-visual fusion strategies using neural networks to improve active speaker detection in meetings, demonstrating that motion information and attention-based fusion significantly enhance accuracy.
Contribution
The study introduces novel fusion methods combining visual, motion, and audio data with neural networks for active speaker detection, outperforming classical approaches.
Findings
Motion information greatly improves detection performance.
Attention-based fusion reduces variability and enhances accuracy.
Visual and audio fusion outperforms single-modality systems.
Abstract
Meetings are a common activity in professional contexts, and it remains challenging to endow vocal assistants with advanced functionalities to facilitate meeting management. In this context, a task like active speaker detection can provide useful insights to model interaction between meeting participants. Motivated by our application context related to advanced meeting assistant, we want to combine audio and visual information to achieve the best possible performance. In this paper, we propose two different types of fusion for the detection of the active speaker, combining two visual modalities and an audio modality through neural networks. For comparison purpose, classical unsupervised approaches for audio feature extraction are also used. We expect visual data centered on the face of each participant to be very appropriate for detecting voice activity, based on the detection of lip…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
