# A lightweight dual branch masking network for environmental sound classification

**Authors:** Guorong Chen, Bao Zhang, Zhikang Ding, Ke Xiao, Pengyu Guan, Xianghan Xiao, Xiaoqiang Wang, Haixin Yi, Hong Hu, Weijie Zhang

PMC · DOI: 10.1038/s41598-025-33636-w · Scientific Reports · 2025-12-31

## TL;DR

The paper introduces SpectroMaskNet, a compact model for environmental sound classification that achieves high accuracy with low computational cost, suitable for real-world applications.

## Contribution

Proposes SpectroMaskNet, a dual-branch architecture with global-local attention and block-masked spectrogram augmentation for efficient sound classification.

## Key findings

- SpectroMaskNet achieves 97.50% accuracy on ESC-10, outperforming existing lightweight models.
- The model maintains low computational complexity while achieving high performance on multiple benchmark datasets.

## Abstract

Environmental sound classification (ESC) is crucial for applications such as intelligent surveillance, urban acoustic monitoring, and human-computer interaction. Although deep neural networks (DNNs) have significantly improved ESC performance, these methods often rely on large models and extensive pretraining, making them difficult to deploy in resource-constrained environments. Some existing lightweight models, while having fewer parameters, still suffer from limited representational capacity, leading to suboptimal generalization, especially in low-data scenarios. To address these challenges, we propose SpectroMaskNet, a compact dual-branch architecture. This design integrates global-local attention mechanisms with block-masked spectrogram augmentation, allowing the model to capture both long-term temporal dependencies and fine-grained spectral features. This enhances robustness and generalization, particularly in data-scarce situations. Experimental results on four benchmark datasets–ESC-10, ESC-50, UrbanSound8K, and SpeechCommandV2–demonstrate that SpectroMaskNet achieves accuracies of 97.50%, 95.50%, 96.32%, and 96.52%, respectively, outperforming existing lightweight baselines without requiring large-scale pretraining. Furthermore, the model maintains low computational complexity, making it well-suited for real-world ESC applications that demand efficiency and scalability.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12847998/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12847998/full.md

## References

24 references — full list in the complete paper: https://tomesphere.com/paper/PMC12847998/full.md

---
Source: https://tomesphere.com/paper/PMC12847998