Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries
Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, Ian McLoughlin

TL;DR
This paper introduces DASM, a novel query-based framework for open-vocabulary sound event detection that uses multi-modal queries and a dual-stream decoder to improve detection and generalization to unseen classes.
Contribution
DASM formulates SED as a frame-level retrieval task with a dual-stream decoder and an inference-time attention masking strategy, enabling effective open-vocabulary detection and cross-modal feature fusion.
Findings
Outperforms CLAP-based methods in open-vocabulary setting (+7.8 PSDS)
Surpasses baseline in closed-set setting (+6.9 PSDS)
Achieves 42.2 PSDS1 score in cross-dataset zero-shot evaluation
Abstract
Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Natural Language Processing Techniques
