Transferring Voice Knowledge for Acoustic Event Detection: An Empirical Study
Dawei Liang, Yangyang Shi, Yun Wang, Nayan Singhal, Alex Xiao,, Jonathan Shaw, Edison Thomaz, Ozlem Kalinli, Mike Seltzer

TL;DR
This study explores transferring voice representations to improve acoustic event detection by developing a dual-branch neural network, showing that joint learning enhances detection accuracy on AudioSet.
Contribution
It introduces a dual-branch neural network architecture for joint learning of voice and acoustic features, demonstrating improved AED performance through transfer learning.
Findings
Joint learning improves AED mean average precision.
Augmenting voice features significantly boosts model performance.
Empirical results on AudioSet validate the approach.
Abstract
Detection of common events and scenes from audio is useful for extracting and understanding human contexts in daily life. Prior studies have shown that leveraging knowledge from a relevant domain is beneficial for a target acoustic event detection (AED) process. Inspired by the observation that many human-centered acoustic events in daily life involve voice elements, this paper investigates the potential of transferring high-level voice representations extracted from a public speaker dataset to enrich an AED pipeline. Towards this end, we develop a dual-branch neural network architecture for the joint learning of voice and acoustic features during an AED process and conduct thorough empirical studies to examine the performance on the public AudioSet [1] with different types of inputs. Our main observations are that: 1) Joint learning of audio and voice inputs improves the AED…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
