SpectroFusion-ViT: A Lightweight Transformer for Speech Emotion Recognition Using Harmonic Mel-Chroma Fusion
Faria Ahmed, Rafi Hassan Chowdhury, Fatema Tuz Zohora Moon, Sabbir Ahmed

TL;DR
SpectroFusion-ViT introduces a lightweight, efficient transformer-based framework for speech emotion recognition that achieves high accuracy on Bangla datasets with minimal computational resources.
Contribution
This work presents a novel lightweight transformer model, SpectroFusion-ViT, combining harmonic Mel-Chroma features for effective speech emotion recognition in resource-constrained environments.
Findings
Achieves 92.56% accuracy on SUBESCO dataset.
Achieves 82.19% accuracy on BanglaSER dataset.
Model contains only 2.04M parameters and requires 0.1 GFLOPs.
Abstract
Speech is a natural means of conveying emotions, making it an effective method for understanding and representing human feelings. Reliable speech emotion recognition (SER) is central to applications in human-computer interaction, healthcare, education, and customer service. However, most SER methods depend on heavy backbone models or hand-crafted features that fail to balance accuracy and efficiency, particularly for low-resource languages like Bangla. In this work, we present SpectroFusion-ViT, a lightweight SER framework built utilizing EfficientViT-b0, a compact Vision Transformer architecture equipped with self-attention to capture long-range temporal and spectral patterns. The model contains only 2.04M parameters and requires 0.1 GFLOPs, enabling deployment in resource-constrained settings without compromising accuracy. Our pipeline first performs preprocessing and augmentation on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Music and Audio Processing
