A Squeeze-and-Excitation and Transformer based Cross-task System for   Environmental Sound Recognition

Jisheng Bai; Jianfeng Chen; Mou Wang; Muhammad Saad Ayub

arXiv:2203.08350·eess.AS·November 22, 2023

A Squeeze-and-Excitation and Transformer based Cross-task System for Environmental Sound Recognition

Jisheng Bai, Jianfeng Chen, Mou Wang, Muhammad Saad Ayub

PDF

Open Access

TL;DR

This paper introduces SE-Trans, a unified cross-task model for environmental sound recognition that leverages attention mechanisms and data augmentation to achieve state-of-the-art results across multiple ESR tasks.

Contribution

The paper proposes a novel cross-task architecture combining Squeeze-and-Excitation and Transformer modules for ESR, enabling knowledge sharing across diverse tasks.

Findings

01

Achieves state-of-the-art performance on multiple ESR tasks

02

Effectively utilizes acoustic knowledge across tasks

03

Improves ESR accuracy with FMix data augmentation

Abstract

Environmental sound recognition (ESR) is an emerging research topic in audio pattern recognition. Many tasks are presented to resort to computational models for ESR in real-life applications. However, current models are usually designed for individual tasks, and are not robust and applicable to other tasks. Cross-task models, which promote unified knowledge modeling across various tasks, have not been thoroughly investigated. In this article, we propose a cross-task model for three different tasks of ESR: 1) acoustic scene classification; 2) urban sound tagging; and 3) anomalous sound detection. An architecture named SE-Trans is presented that uses attention mechanism-based Squeeze-and-Excitation and Transformer encoder modules to learn the channelwise relationship and temporal dependencies of the acoustic features. FMix is employed as the data augmentation method that improves the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Diverse Musicological Studies

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Softmax · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Dropout · Layer Normalization