ToS: A Team of Specialists ensemble framework for Stereo Sound Event Localization and Detection with distance estimation in Video
Davide Berghi, Philip J. B. Jackson

TL;DR
The paper introduces ToS, an ensemble framework combining three specialized models to improve 3D sound event localization and detection with distance estimation in videos, outperforming existing methods.
Contribution
It proposes a novel ensemble of three complementary sub-networks, each focusing on different dimensions, to enhance multimodal 3D SELD performance.
Findings
Outperforms state-of-the-art models on DCASE2025 Task 3 dataset
Demonstrates the effectiveness of specialized sub-networks in multimodal tasks
Provides a flexible ensemble framework for sound event localization and detection
Abstract
Sound event localization and detection with distance estimation (3D SELD) in video involves identifying active sound events at each time frame while estimating their spatial coordinates. This multimodal task requires joint reasoning across semantic, spatial, and temporal dimensions, a challenge that single models often struggle to address effectively. To tackle this, we introduce the Team of Specialists (ToS) ensemble framework, which integrates three complementary sub-networks: a spatio-linguistic model, a spatio-temporal model, and a tempo-linguistic model. Each sub-network specializes in a unique pair of dimensions, contributing distinct insights to the final prediction, akin to a collaborative team with diverse expertise. ToS has been benchmarked against state-of-the-art audio-visual models for 3D SELD on the DCASE2025 Task 3 Stereo SELD development set, consistently outperforming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Multimodal Machine Learning Applications
