A Squeeze-and-Excitation and Transformer based Cross-task System for Environmental Sound Recognition
Jisheng Bai, Jianfeng Chen, Mou Wang, Muhammad Saad Ayub

TL;DR
This paper introduces SE-Trans, a unified cross-task model for environmental sound recognition that leverages attention mechanisms and data augmentation to achieve state-of-the-art results across multiple ESR tasks.
Contribution
The paper proposes a novel cross-task architecture combining Squeeze-and-Excitation and Transformer modules for ESR, enabling knowledge sharing across diverse tasks.
Findings
Achieves state-of-the-art performance on multiple ESR tasks
Effectively utilizes acoustic knowledge across tasks
Improves ESR accuracy with FMix data augmentation
Abstract
Environmental sound recognition (ESR) is an emerging research topic in audio pattern recognition. Many tasks are presented to resort to computational models for ESR in real-life applications. However, current models are usually designed for individual tasks, and are not robust and applicable to other tasks. Cross-task models, which promote unified knowledge modeling across various tasks, have not been thoroughly investigated. In this article, we propose a cross-task model for three different tasks of ESR: 1) acoustic scene classification; 2) urban sound tagging; and 3) anomalous sound detection. An architecture named SE-Trans is presented that uses attention mechanism-based Squeeze-and-Excitation and Transformer encoder modules to learn the channelwise relationship and temporal dependencies of the acoustic features. FMix is employed as the data augmentation method that improves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Diverse Musicological Studies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Softmax · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Dropout · Layer Normalization
