Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers
Sharath Adavanne, Archontis Politis, Tuomas Virtanen

TL;DR
This paper introduces a differentiable training method for deep learning sound source localizers that directly optimizes tracking metrics, improving multi-source localization and tracking without auxiliary information.
Contribution
It adapts a differentiable network approach from video object detection to sound source localization, enabling end-to-end training for multi-source scenarios.
Findings
Significant reduction in localization error.
Improved detection and tracking metrics.
Enhanced multi-source tracking capabilities.
Abstract
Data-based and learning-based sound source localization (SSL) has shown promising results in challenging conditions, and is commonly set as a classification or a regression problem. Regression-based approaches have certain advantages over classification-based, such as continuous direction-of-arrival estimation of static and moving sources. However, multi-source scenarios require multiple regressors without a clear training strategy up-to-date, that does not rely on auxiliary information such as simultaneous sound classification. We investigate end-to-end training of such methods with a technique recently proposed for video object detectors, adapted to the SSL setting. A differentiable network is constructed that can be plugged to the output of the localizer to solve the optimal assignment between predictions and references, optimizing directly the popular CLEAR-MOT tracking metrics.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Underwater Acoustics Research
