Audio-Visual Approach For Multimodal Concurrent Speaker Detection
Amit Eliav, Sharon Gannot

TL;DR
This paper introduces a multimodal deep learning model that combines audio and visual data for concurrent speaker detection, demonstrating effectiveness on real-world datasets and pioneering results on the challenging EasyCom dataset.
Contribution
The study presents the first multimodal deep learning approach for CSD evaluated on EasyCom, using cross-modal attention and early fusion strategies.
Findings
Effective multimodal fusion improves CSD accuracy.
Model achieves state-of-the-art results on AMI and EasyCom datasets.
Ablation study confirms the importance of design choices.
Abstract
Concurrent Speaker Detection (CSD), the task of identifying active speakers and their overlaps in an audio signal, is essential for various audio applications, including meeting transcription, speaker diarization, and speech separation. This study presents a multimodal deep learning approach that integrates audio and visual information. The proposed model utilizes an early fusion strategy, combining audio and visual features through cross-modal attention mechanisms with a learnable [CLS] token to capture key audio-visual relationships. The model is extensively evaluated on two real-world datasets, the established AMI dataset and the recently introduced EasyCom dataset. Experiments validate the effectiveness of the multimodal fusion strategy. An ablation study further supports the design choices and the model's training procedure. As this is the first work reporting CSD results on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
MethodsSoftmax · Attention Is All You Need
