Acoustic Scene Classification Using Fusion of Attentive Convolutional Neural Networks for DCASE2019 Challenge
Hossein Zeinali, Luk\'a\v{s} Burget, Jan "Honza'' \v{C}ernock\'y

TL;DR
This paper presents a fusion of three attentive CNN architectures for acoustic scene classification, demonstrating improved performance on the DCASE2019 challenge dataset through multi-model fusion and self-attention mechanisms.
Contribution
It introduces a novel fusion approach combining VGG-like, Light-CNN, and x-vector CNNs with self-attention for enhanced acoustic scene classification.
Findings
Fusion of multiple CNNs improves classification accuracy.
Self-attention mechanisms enhance feature pooling.
The approach achieves competitive results in DCASE2019 challenge.
Abstract
In this report, the Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2019 challenge are described. Also, the analysis of different methods is provided. The proposed approach is a fusion of three different Convolutional Neural Network (CNN) topologies. The first one is a VGG like two-dimensional CNNs. The second one is again a two-dimensional CNN network which uses Max-Feature-Map activation and called Light-CNN (LCNN). The third network is a one-dimensional CNN which mainly used for speaker verification and called x-vector topology. All proposed networks use self-attention mechanism for statistic pooling. As a feature, we use a 256-dimensional log Mel-spectrogram. Our submissions are a fusion of several networks trained on 4-folds generated evaluation setup using different fusion strategies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
