Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers
Yutong Ban, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud

TL;DR
This paper introduces a variational Bayesian approach for audio-visual multi-speaker tracking, effectively integrating visual and auditory data to improve accuracy and robustness in dynamic, real-world scenarios.
Contribution
It presents a novel variational inference framework for audio-visual tracking that handles missing data and speaker activity estimation, outperforming baseline methods.
Findings
The proposed method accurately tracks multiple speakers in informal meetings.
It effectively manages partial or missing modality data.
The approach outperforms several baseline tracking algorithms.
Abstract
In this paper we address the problem of tracking multiple speakers via the fusion of visual and auditory information. We propose to exploit the complementary nature of these two modalities in order to accurately estimate smooth trajectories of the tracked persons, to deal with the partial or total absence of one of the modalities over short periods of time, and to estimate the acoustic status -- either speaking or silent -- of each tracked person along time. We propose to cast the problem at hand into a generative audio-visual fusion (or association) model formulated as a latent-variable temporal graphical model. This may well be viewed as the problem of maximizing the posterior joint distribution of a set of continuous and discrete latent variables given the past and current observations, which is intractable. We propose a variational inference model which amounts to approximate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Video Surveillance and Tracking Methods
