Sound Event Bounding Boxes
Janek Ebbers, Francois G. Germain, Gordon Wichern, Jonathan Le Roux

TL;DR
This paper introduces SEBBs, a new format for sound event detection that separates event extent prediction from confidence, significantly improving system performance and surpassing previous state-of-the-art results.
Contribution
The paper proposes SEBBs, a novel prediction format for sound events, and a change-detection algorithm to enhance event extent accuracy and system performance.
Findings
SEBBs improve event extent prediction accuracy.
The change-detection algorithm boosts DCASE 2023 Challenge system performance.
State-of-the-art PSDS1 score increased from .644 to .686.
Abstract
Sound event detection is the task of recognizing sounds and determining their extent (onset/offset times) within an audio clip. Existing systems commonly predict sound presence confidence in short time frames. Then, thresholding produces binary frame-level presence decisions, with the extent of individual events determined by merging consecutive positive frames. In this paper, we show that frame-level thresholding degrades the prediction of the event extent by coupling it with the system's sound presence confidence. We propose to decouple the prediction of event extent and confidence by introducing SEBBs, which format each sound event prediction as a tuple of a class type, extent, and overall confidence. We also propose a change-detection-based algorithm to convert legacy frame-level outputs into SEBBs. We find the algorithm significantly improves the performance of DCASE 2023 Challenge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Time Series Analysis and Forecasting
