Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using   Permutation-Free Loss Function

Qing Wang; Hang Chen; Ya Jiang; Zhe Wang; Yuyang Wang; Jun Du and; Chin-Hui Lee

arXiv:2210.14581·eess.AS·October 27, 2022·1 cites

Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using Permutation-Free Loss Function

Qing Wang, Hang Chen, Ya Jiang, Zhe Wang, Yuyang Wang, Jun Du and, Chin-Hui Lee

PDF

Open Access

TL;DR

This paper introduces a deep learning approach for multi-speaker DOA estimation using audio-visual data and a permutation-free loss, improving accuracy over audio-only methods in real-world scenarios.

Contribution

It presents a novel spatial annotation method and a permutation-free loss function for multi-speaker DOA estimation with audio-visual signals.

Findings

01

Outperforms audio-only DOA estimation significantly

02

Effective in real-life home TV scenarios

03

Validated on both simulated and real data

Abstract

In this paper, we propose a deep learning based multi-speaker direction of arrival (DOA) estimation with audio and visual signals by using permutation-free loss function. We first collect a data set for multi-modal sound source localization (SSL) where both audio and visual signals are recorded in real-life home TV scenarios. Then we propose a novel spatial annotation method to produce the ground truth of DOA for each speaker with the video data by transformation between camera coordinate and pixel coordinate according to the pin-hole camera model. With spatial location information served as another input along with acoustic feature, multi-speaker DOA estimation could be solved as a classification task of active speaker detection. Label permutation problem in multi-speaker related tasks will be addressed since the locations of each speaker are used as input. Experiments conducted on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Blind Source Separation Techniques