Evaluating CNN with Stacked Feature Representations and Audio Spectrogram Transformer Models for Sound Classification
Parinaz Binandeh Dehaghania, Danilo Penab, A. Pedro Aguiar

TL;DR
This study compares CNNs with stacked acoustic features and transformer models for sound classification, showing CNNs are more efficient with limited data and resources, especially in edge applications.
Contribution
It introduces a comprehensive analysis of feature-stacked CNNs versus transformer models across multiple datasets and training regimes, highlighting their relative advantages.
Findings
Feature-stacked CNNs perform well with limited data.
Transformer models excel with large-scale pretraining.
CNNs are more resource-efficient for edge scenarios.
Abstract
Environmental sound classification (ESC) has gained significant attention due to its diverse applications in smart city monitoring, fault detection, acoustic surveillance, and manufacturing quality control. To enhance CNN performance, feature stacking techniques have been explored to aggregate complementary acoustic descriptors into richer input representations. In this paper, we investigate CNN-based models employing various stacked feature combinations, including Log-Mel Spectrogram (LM), Spectral Contrast (SPC), Chroma (CH), Tonnetz (TZ), Mel-Frequency Cepstral Coefficients (MFCCs), and Gammatone Cepstral Coefficients (GTCC). Experiments are conducted on the widely used ESC-50 and UrbanSound8K datasets under different training regimes, including pretraining on ESC-50, fine-tuning on UrbanSound8K, and comparison with Audio Spectrogram Transformer (AST) models pretrained on large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Noise Effects and Management · Speech and Audio Processing
