Explainable Transformer-CNN Fusion for Noise-Robust Speech Emotion Recognition

Sudip Chakrabarty; Pappu Bishwas; Rajdeep Chatterjee

arXiv:2512.18298·cs.SD·December 23, 2025

Explainable Transformer-CNN Fusion for Noise-Robust Speech Emotion Recognition

Sudip Chakrabarty, Pappu Bishwas, Rajdeep Chatterjee

PDF

Open Access

TL;DR

This paper introduces a hybrid Transformer-CNN model for noise-robust speech emotion recognition that combines contextual and spectral features, achieving superior accuracy and interpretability in noisy environments.

Contribution

The paper presents a novel dual-stream Transformer-CNN framework that unifies long-range temporal modeling with noise-resistant spectral feature extraction for SER.

Findings

01

Outperforms baseline models on four benchmark datasets.

02

Maintains high accuracy under real-world acoustic interference.

03

Provides visual explanations of model decision-making using SHAP and Score-CAM.

Abstract

Speech Emotion Recognition (SER) systems often degrade in performance when exposed to the unpredictable acoustic interference found in real-world environments. Additionally, the opacity of deep learning models hinders their adoption in trust-sensitive applications. To bridge this gap, we propose a Hybrid Transformer-CNN framework that unifies the contextual modeling of Wav2Vec 2.0 with the spectral stability of 1D-Convolutional Neural Networks. Our dual-stream architecture processes raw waveforms to capture long-range temporal dependencies while simultaneously extracting noise-resistant spectral features (MFCC, ZCR, RMSE) via a custom Attentive Temporal Pooling mechanism. We conducted extensive validation across four diverse benchmark datasets: RAVDESS, TESS, SAVEE, and CREMA-D. To rigorously test robustness, we subjected the model to non-stationary acoustic interference using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech and Audio Processing