Evaluating CNN with Stacked Feature Representations and Audio Spectrogram Transformer Models for Sound Classification

Parinaz Binandeh Dehaghania; Danilo Penab; A. Pedro Aguiar

arXiv:2602.09321·eess.AS·February 25, 2026

Evaluating CNN with Stacked Feature Representations and Audio Spectrogram Transformer Models for Sound Classification

Parinaz Binandeh Dehaghania, Danilo Penab, A. Pedro Aguiar

PDF

Open Access

TL;DR

This study compares CNNs with stacked acoustic features and transformer models for sound classification, showing CNNs are more efficient with limited data and resources, especially in edge applications.

Contribution

It introduces a comprehensive analysis of feature-stacked CNNs versus transformer models across multiple datasets and training regimes, highlighting their relative advantages.

Findings

01

Feature-stacked CNNs perform well with limited data.

02

Transformer models excel with large-scale pretraining.

03

CNNs are more resource-efficient for edge scenarios.

Abstract

Environmental sound classification (ESC) has gained significant attention due to its diverse applications in smart city monitoring, fault detection, acoustic surveillance, and manufacturing quality control. To enhance CNN performance, feature stacking techniques have been explored to aggregate complementary acoustic descriptors into richer input representations. In this paper, we investigate CNN-based models employing various stacked feature combinations, including Log-Mel Spectrogram (LM), Spectral Contrast (SPC), Chroma (CH), Tonnetz (TZ), Mel-Frequency Cepstral Coefficients (MFCCs), and Gammatone Cepstral Coefficients (GTCC). Experiments are conducted on the widely used ESC-50 and UrbanSound8K datasets under different training regimes, including pretraining on ESC-50, fine-tuning on UrbanSound8K, and comparison with Audio Spectrogram Transformer (AST) models pretrained on large-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Noise Effects and Management · Speech and Audio Processing