Transferring Voice Knowledge for Acoustic Event Detection: An Empirical   Study

Dawei Liang; Yangyang Shi; Yun Wang; Nayan Singhal; Alex Xiao,; Jonathan Shaw; Edison Thomaz; Ozlem Kalinli; Mike Seltzer

arXiv:2110.03174·cs.SD·October 8, 2021·1 cites

Transferring Voice Knowledge for Acoustic Event Detection: An Empirical Study

Dawei Liang, Yangyang Shi, Yun Wang, Nayan Singhal, Alex Xiao,, Jonathan Shaw, Edison Thomaz, Ozlem Kalinli, Mike Seltzer

PDF

Open Access

TL;DR

This study explores transferring voice representations to improve acoustic event detection by developing a dual-branch neural network, showing that joint learning enhances detection accuracy on AudioSet.

Contribution

It introduces a dual-branch neural network architecture for joint learning of voice and acoustic features, demonstrating improved AED performance through transfer learning.

Findings

01

Joint learning improves AED mean average precision.

02

Augmenting voice features significantly boosts model performance.

03

Empirical results on AudioSet validate the approach.

Abstract

Detection of common events and scenes from audio is useful for extracting and understanding human contexts in daily life. Prior studies have shown that leveraging knowledge from a relevant domain is beneficial for a target acoustic event detection (AED) process. Inspired by the observation that many human-centered acoustic events in daily life involve voice elements, this paper investigates the potential of transferring high-level voice representations extracted from a public speaker dataset to enrich an AED pipeline. Towards this end, we develop a dual-branch neural network architecture for the joint learning of voice and acoustic features during an AED process and conduct thorough empirical studies to examine the performance on the public AudioSet [1] with different types of inputs. Our main observations are that: 1) Joint learning of audio and voice inputs improves the AED…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis