Cross-scale Attention Model for Acoustic Event Classification
Xugang Lu, Peng Shen, Sheng Li, Yu Tsao, Hisashi Kawai

TL;DR
This paper introduces a cross-scale attention model for acoustic event classification that combines features from different scales using attention mechanisms, improving the detection of both short- and long-duration sounds.
Contribution
The paper proposes a novel cross-scale attention model that explicitly integrates multi-scale features with attention weighting, enhancing acoustic event classification performance.
Findings
Improved classification accuracy on urban AEC dataset
Enhanced detection of short- and long-duration acoustic events
Model outperforms existing state-of-the-art methods
Abstract
A major advantage of a deep convolutional neural network (CNN) is that the focused receptive field size is increased by stacking multiple convolutional layers. Accordingly, the model can explore the long-range dependency of features from the top layers. However, a potential limitation of the network is that the discriminative features from the bottom layers (which can model the short-range dependency) are smoothed out in the final representation. This limitation is especially evident in the acoustic event classification (AEC) task, where both short- and long-duration events are involved in an audio clip and needed to be classified. In this paper, we propose a cross-scale attention (CSA) model, which explicitly integrates features from different scales to form the final representation. Moreover, we propose the adoption of the attention mechanism to specify the weights of local and global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
