ToS: A Team of Specialists ensemble framework for Stereo Sound Event Localization and Detection with distance estimation in Video

Davide Berghi; Philip J. B. Jackson

arXiv:2601.17611·eess.AS·January 27, 2026

ToS: A Team of Specialists ensemble framework for Stereo Sound Event Localization and Detection with distance estimation in Video

Davide Berghi, Philip J. B. Jackson

PDF

Open Access

TL;DR

The paper introduces ToS, an ensemble framework combining three specialized models to improve 3D sound event localization and detection with distance estimation in videos, outperforming existing methods.

Contribution

It proposes a novel ensemble of three complementary sub-networks, each focusing on different dimensions, to enhance multimodal 3D SELD performance.

Findings

01

Outperforms state-of-the-art models on DCASE2025 Task 3 dataset

02

Demonstrates the effectiveness of specialized sub-networks in multimodal tasks

03

Provides a flexible ensemble framework for sound event localization and detection

Abstract

Sound event localization and detection with distance estimation (3D SELD) in video involves identifying active sound events at each time frame while estimating their spatial coordinates. This multimodal task requires joint reasoning across semantic, spatial, and temporal dimensions, a challenge that single models often struggle to address effectively. To tackle this, we introduce the Team of Specialists (ToS) ensemble framework, which integrates three complementary sub-networks: a spatio-linguistic model, a spatio-temporal model, and a tempo-linguistic model. Each sub-network specializes in a unique pair of dimensions, contributing distinct insights to the final prediction, akin to a collaborative team with diverse expertise. ToS has been benchmarked against state-of-the-art audio-visual models for 3D SELD on the DCASE2025 Task 3 Stereo SELD development set, consistently outperforming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Multimodal Machine Learning Applications