Explainable Transformer-CNN Fusion for Noise-Robust Speech Emotion Recognition
Sudip Chakrabarty, Pappu Bishwas, Rajdeep Chatterjee

TL;DR
This paper introduces a hybrid Transformer-CNN model for noise-robust speech emotion recognition that combines contextual and spectral features, achieving superior accuracy and interpretability in noisy environments.
Contribution
The paper presents a novel dual-stream Transformer-CNN framework that unifies long-range temporal modeling with noise-resistant spectral feature extraction for SER.
Findings
Outperforms baseline models on four benchmark datasets.
Maintains high accuracy under real-world acoustic interference.
Provides visual explanations of model decision-making using SHAP and Score-CAM.
Abstract
Speech Emotion Recognition (SER) systems often degrade in performance when exposed to the unpredictable acoustic interference found in real-world environments. Additionally, the opacity of deep learning models hinders their adoption in trust-sensitive applications. To bridge this gap, we propose a Hybrid Transformer-CNN framework that unifies the contextual modeling of Wav2Vec 2.0 with the spectral stability of 1D-Convolutional Neural Networks. Our dual-stream architecture processes raw waveforms to capture long-range temporal dependencies while simultaneously extracting noise-resistant spectral features (MFCC, ZCR, RMSE) via a custom Attentive Temporal Pooling mechanism. We conducted extensive validation across four diverse benchmark datasets: RAVDESS, TESS, SAVEE, and CREMA-D. To rigorously test robustness, we subjected the model to non-stationary acoustic interference using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech and Audio Processing
