Shared Representation Learning for Reference-Guided Targeted Sound Detection

Shubham Gupta; Adarsh Arigala; B. R. Dilleswari; Sri Rama Murty Kodukula

arXiv:2603.17025·eess.AS·March 19, 2026

Shared Representation Learning for Reference-Guided Targeted Sound Detection

Shubham Gupta, Adarsh Arigala, B. R. Dilleswari, Sri Rama Murty Kodukula

PDF

Open Access

TL;DR

This paper introduces a shared encoder architecture for targeted sound detection that improves generalization and simplifies the model, achieving state-of-the-art results on the URBAN-SED dataset.

Contribution

Proposes a unified shared encoder for reference-guided sound detection, enhancing generalization and reducing complexity compared to prior methods.

Findings

01

Achieves a segment-level F1 score of 83.15%

02

Attains an overall accuracy of 95.17%

03

Sets a new state-of-the-art benchmark on URBAN-SED

Abstract

Human listeners exhibit the remarkable ability to segregate a desired sound from complex acoustic scenes through selective auditory attention, motivating the study of Targeted Sound Detection (TSD). The task requires detecting and localizing a target sound in a mixture when a reference audio of that sound is provided. Prior approaches, rely on generating a sound-discriminative conditional embedding vector for the reference and pairing it with a mixture encoder, jointly optimized with a multi-task learning approach. In this work, we propose a unified encoder architecture that processes both the reference and mixture audio within a shared representation space, promoting stronger alignment while reducing architectural complexity. This design choice not only simplifies the overall framework but also enhances generalization to unseen classes. Following the multi-task training paradigm, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation