Audio-video fusion strategies for active speaker detection in meetings

Lionel Pibre; Francisco Madrigal; Cyrille Equoy; Fr\'ed\'eric Lerasle,; Thomas Pellegrini; Julien Pinquier; Isabelle Ferran\'e

arXiv:2206.10411·cs.CV·June 22, 2022·1 cites

Audio-video fusion strategies for active speaker detection in meetings

Lionel Pibre, Francisco Madrigal, Cyrille Equoy, Fr\'ed\'eric Lerasle,, Thomas Pellegrini, Julien Pinquier, Isabelle Ferran\'e

PDF

Open Access

TL;DR

This paper explores audio-visual fusion strategies using neural networks to improve active speaker detection in meetings, demonstrating that motion information and attention-based fusion significantly enhance accuracy.

Contribution

The study introduces novel fusion methods combining visual, motion, and audio data with neural networks for active speaker detection, outperforming classical approaches.

Findings

01

Motion information greatly improves detection performance.

02

Attention-based fusion reduces variability and enhances accuracy.

03

Visual and audio fusion outperforms single-modality systems.

Abstract

Meetings are a common activity in professional contexts, and it remains challenging to endow vocal assistants with advanced functionalities to facilitate meeting management. In this context, a task like active speaker detection can provide useful insights to model interaction between meeting participants. Motivated by our application context related to advanced meeting assistant, we want to combine audio and visual information to achieve the best possible performance. In this paper, we propose two different types of fusion for the detection of the active speaker, combining two visual modalities and an audio modality through neural networks. For comparison purpose, classical unsupervised approaches for audio feature extraction are also used. We expect visual data centered on the face of each participant to be very appropriate for detecting voice activity, based on the detection of lip…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing