MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for   Heterogeneous Sound Event Detection

Zehao Wang; Haobo Yue; Zhicheng Zhang; Da Mu; Jin Tang; Jianqin Yin

arXiv:2409.06196·cs.SD·September 12, 2024

MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

Zehao Wang, Haobo Yue, Zhicheng Zhang, Da Mu, Jin Tang, Jianqin Yin

PDF

Open Access 1 Repo

TL;DR

This paper introduces MTDA-HSED, a dual-branch architecture with mutual assistance tuning and aggregation, improving heterogeneous sound event detection by effectively learning from complex acoustic scenes.

Contribution

It proposes a novel dual-branch architecture with Mutual-Assistance Audio Adapter and Deep Fusion modules to enhance feature learning across diverse datasets.

Findings

01

Exceeds baseline mpAUC by 5% on DESED and MAESTRO datasets.

02

Effectively handles multi-scenario and multi-granularity problems.

03

Improves performance of sound event detection in heterogeneous environments.

Abstract

Sound Event Detection (SED) plays a vital role in comprehending and perceiving acoustic scenes. Previous methods have demonstrated impressive capabilities. However, they are deficient in learning features of complex scenes from heterogeneous dataset. In this paper, we introduce a novel dual-branch architecture named Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection (MTDA-HSED). The MTDA-HSED architecture employs the Mutual-Assistance Audio Adapter (M3A) to effectively tackle the multi-scenario problem and uses the Dual-Branch Mid-Fusion (DBMF) module to tackle the multi-granularity problem. Specifically, M3A is integrated into the BEATs block as an adapter to improve the BEATs' performance by fine-tuning it on the multi-scenario dataset. The DBMF module connects BEATs and CNN branches, which facilitates the deep fusion of information from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

visitor-w/mtda
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsAdapter