Improving Acoustic Scene Classification in Low-Resource Conditions

Zhi Chen; Yun-Fei Shao; Yong Ma; Mingsheng Wei; Le Zhang; Wei-Qiang; Zhang

arXiv:2412.20722·eess.AS·April 29, 2025

Improving Acoustic Scene Classification in Low-Resource Conditions

Zhi Chen, Yun-Fei Shao, Yong Ma, Mingsheng Wei, Le Zhang, Wei-Qiang, Zhang

PDF

Open Access

TL;DR

This paper introduces DS-FlexiNet, a novel efficient model for acoustic scene classification that combines advanced convolutional techniques, model compression, data augmentation, and knowledge distillation to improve performance in low-resource and heterogeneous device environments.

Contribution

The paper presents DS-FlexiNet, a new model architecture that integrates depthwise separable convolutions, residual connections, and domain-specific normalization for low-resource acoustic scene classification.

Findings

01

DS-FlexiNet outperforms existing models in low-resource settings.

02

Quantization Aware Training reduces model size with minimal accuracy loss.

03

Knowledge Distillation improves cross-device generalization.

Abstract

Acoustic Scene Classification (ASC) identifies an environment based on an audio signal. This paper explores ASC in low-resource conditions and proposes a novel model, DS-FlexiNet, which combines depthwise separable convolutions from MobileNetV2 with ResNet-inspired residual connections for a balance of efficiency and accuracy. To address hardware limitations and device heterogeneity, DS-FlexiNet employs Quantization Aware Training (QAT) for model compression and data augmentation methods like Auto Device Impulse Response (ADIR) and Freq-MixStyle (FMS) to improve cross-device generalization. Knowledge Distillation (KD) from twelve teacher models further enhances performance on unseen devices. The architecture includes a custom Residual Normalization layer to handle domain differences across devices, and depthwise separable convolutions reduce computational overhead without sacrificing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing

MethodsDepthwise Convolution · Pointwise Convolution · Depthwise Separable Convolution · Batch Normalization · Attentive Walk-Aggregating Graph Neural Network · 1x1 Convolution · Convolution · Inverted Residual Block · Knowledge Distillation · Average Pooling