Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using Permutation-Free Loss Function
Qing Wang, Hang Chen, Ya Jiang, Zhe Wang, Yuyang Wang, Jun Du and, Chin-Hui Lee

TL;DR
This paper introduces a deep learning approach for multi-speaker DOA estimation using audio-visual data and a permutation-free loss, improving accuracy over audio-only methods in real-world scenarios.
Contribution
It presents a novel spatial annotation method and a permutation-free loss function for multi-speaker DOA estimation with audio-visual signals.
Findings
Outperforms audio-only DOA estimation significantly
Effective in real-life home TV scenarios
Validated on both simulated and real data
Abstract
In this paper, we propose a deep learning based multi-speaker direction of arrival (DOA) estimation with audio and visual signals by using permutation-free loss function. We first collect a data set for multi-modal sound source localization (SSL) where both audio and visual signals are recorded in real-life home TV scenarios. Then we propose a novel spatial annotation method to produce the ground truth of DOA for each speaker with the video data by transformation between camera coordinate and pixel coordinate according to the pin-hole camera model. With spatial location information served as another input along with acoustic feature, multi-speaker DOA estimation could be solved as a classification task of active speaker detection. Label permutation problem in multi-speaker related tasks will be addressed since the locations of each speaker are used as input. Experiments conducted on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Blind Source Separation Techniques
