Audio-Visual Approach For Multimodal Concurrent Speaker Detection

Amit Eliav; Sharon Gannot

arXiv:2407.01774·eess.AS·January 16, 2025

Audio-Visual Approach For Multimodal Concurrent Speaker Detection

Amit Eliav, Sharon Gannot

PDF

Open Access

TL;DR

This paper introduces a multimodal deep learning model that combines audio and visual data for concurrent speaker detection, demonstrating effectiveness on real-world datasets and pioneering results on the challenging EasyCom dataset.

Contribution

The study presents the first multimodal deep learning approach for CSD evaluated on EasyCom, using cross-modal attention and early fusion strategies.

Findings

01

Effective multimodal fusion improves CSD accuracy.

02

Model achieves state-of-the-art results on AMI and EasyCom datasets.

03

Ablation study confirms the importance of design choices.

Abstract

Concurrent Speaker Detection (CSD), the task of identifying active speakers and their overlaps in an audio signal, is essential for various audio applications, including meeting transcription, speaker diarization, and speech separation. This study presents a multimodal deep learning approach that integrates audio and visual information. The proposed model utilizes an early fusion strategy, combining audio and visual features through cross-modal attention mechanisms with a learnable [CLS] token to capture key audio-visual relationships. The model is extensively evaluated on two real-world datasets, the established AMI dataset and the recently introduced EasyCom dataset. Experiments validate the effectiveness of the multimodal fusion strategy. An ablation study further supports the design choices and the model's training procedure. As this is the first work reporting CSD results on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need