Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion
Israel D. Gebru, Sil\`eye Ba, Xiaofei Li, Radu Horaud

TL;DR
This paper introduces a novel audio-visual speaker diarization method that effectively handles multi-party interactions with movement and head turns, utilizing spatiotemporal fusion and a graphical model for improved accuracy.
Contribution
It presents a new audio-visual fusion approach and a probabilistic graphical model for speaker diarization in dynamic multi-participant scenarios, including a novel dataset.
Findings
Outperforms existing diarization algorithms on benchmark datasets.
Handles simultaneous speech from multiple speakers effectively.
Provides an efficient inference procedure for complex models.
Abstract
Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around and turn their heads towards the other participants rather than facing the cameras and the microphones. Multiple-person visual tracking is combined with multiple speech-source localization in order to tackle the speech-to-person association problem. The latter is solved within a novel audio-visual fusion method on the following grounds: binaural spectral features are first extracted from a microphone pair, then a supervised audio-visual alignment technique maps these features onto an image, and finally a semi-supervised clustering method assigns binaural spectral features to visible persons.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
