Towards Open World Sound Event Detection

P.H.Hai; L.T.Minh; L.H.Son

arXiv:2605.03934·cs.SD·May 22, 2026

Towards Open World Sound Event Detection

P.H.Hai, L.T.Minh, L.H.Son

PDF

TL;DR

This paper introduces OW-SED, a new paradigm for sound event detection that handles known, unknown, and incremental learning of acoustic events using a deformable transformer architecture.

Contribution

It proposes a novel OW-SED framework with a deformable attention-based architecture and a disentangled representation approach for open-world sound event detection.

Findings

01

Achieves marginally better performance in closed-world settings.

02

Significantly outperforms baselines in open-world scenarios.

03

Demonstrates effectiveness of deformable attention and feature disentanglement.

Abstract

Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world environments where novel acoustic events frequently emerge. Inspired by the success of open-world learning in computer vision, we introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To tackle the unique challenges of OW-SED, such as overlapping and ambiguous events, we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions. Furthermore, we design a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework incorporating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.