Position tracking of a varying number of sound sources with sliding permutation invariant training
David Diaz-Guerra, Archontis Politis, Tuomas Virtanen

TL;DR
This paper introduces a novel training strategy for deep learning sound source localization models that effectively tracks multiple moving sources with varying numbers, reducing identity switches while maintaining localization accuracy.
Contribution
It proposes a straightforward mean squared error-based training method that handles time-varying source counts and preserves source identities across frames.
Findings
Reduces identity switches in multi-source tracking
Maintains high frame-wise localization accuracy
Effective on simulated reverberant moving sources
Abstract
Recent data- and learning-based sound source localization (SSL) methods have shown strong performance in challenging acoustic scenarios. However, little work has been done on adapting such methods to track consistently multiple sources appearing and disappearing, as would occur in reality. In this paper, we present a new training strategy for deep learning SSL models with a straightforward implementation based on the mean squared error of the optimal association between estimated and reference positions in the preceding time frames. It optimizes the desired properties of a tracking system: handling a time-varying number of sources and ordering localization estimates according to their trajectories, minimizing identity switches (IDSs). Evaluation on simulated data of multiple reverberant moving sources and on two model architectures proves its effectiveness on reducing identity switches…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Underwater Acoustics Research
