Improving Acoustic Scene Classification in Low-Resource Conditions
Zhi Chen, Yun-Fei Shao, Yong Ma, Mingsheng Wei, Le Zhang, Wei-Qiang, Zhang

TL;DR
This paper introduces DS-FlexiNet, a novel efficient model for acoustic scene classification that combines advanced convolutional techniques, model compression, data augmentation, and knowledge distillation to improve performance in low-resource and heterogeneous device environments.
Contribution
The paper presents DS-FlexiNet, a new model architecture that integrates depthwise separable convolutions, residual connections, and domain-specific normalization for low-resource acoustic scene classification.
Findings
DS-FlexiNet outperforms existing models in low-resource settings.
Quantization Aware Training reduces model size with minimal accuracy loss.
Knowledge Distillation improves cross-device generalization.
Abstract
Acoustic Scene Classification (ASC) identifies an environment based on an audio signal. This paper explores ASC in low-resource conditions and proposes a novel model, DS-FlexiNet, which combines depthwise separable convolutions from MobileNetV2 with ResNet-inspired residual connections for a balance of efficiency and accuracy. To address hardware limitations and device heterogeneity, DS-FlexiNet employs Quantization Aware Training (QAT) for model compression and data augmentation methods like Auto Device Impulse Response (ADIR) and Freq-MixStyle (FMS) to improve cross-device generalization. Knowledge Distillation (KD) from twelve teacher models further enhances performance on unseen devices. The architecture includes a custom Residual Normalization layer to handle domain differences across devices, and depthwise separable convolutions reduce computational overhead without sacrificing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing
MethodsDepthwise Convolution · Pointwise Convolution · Depthwise Separable Convolution · Batch Normalization · Attentive Walk-Aggregating Graph Neural Network · 1x1 Convolution · Convolution · Inverted Residual Block · Knowledge Distillation · Average Pooling
