SpectroFusion-ViT: A Lightweight Transformer for Speech Emotion Recognition Using Harmonic Mel-Chroma Fusion

Faria Ahmed; Rafi Hassan Chowdhury; Fatema Tuz Zohora Moon; Sabbir Ahmed

arXiv:2603.00746·cs.SD·March 3, 2026

SpectroFusion-ViT: A Lightweight Transformer for Speech Emotion Recognition Using Harmonic Mel-Chroma Fusion

Faria Ahmed, Rafi Hassan Chowdhury, Fatema Tuz Zohora Moon, Sabbir Ahmed

PDF

Open Access

TL;DR

SpectroFusion-ViT introduces a lightweight, efficient transformer-based framework for speech emotion recognition that achieves high accuracy on Bangla datasets with minimal computational resources.

Contribution

This work presents a novel lightweight transformer model, SpectroFusion-ViT, combining harmonic Mel-Chroma features for effective speech emotion recognition in resource-constrained environments.

Findings

01

Achieves 92.56% accuracy on SUBESCO dataset.

02

Achieves 82.19% accuracy on BanglaSER dataset.

03

Model contains only 2.04M parameters and requires 0.1 GFLOPs.

Abstract

Speech is a natural means of conveying emotions, making it an effective method for understanding and representing human feelings. Reliable speech emotion recognition (SER) is central to applications in human-computer interaction, healthcare, education, and customer service. However, most SER methods depend on heavy backbone models or hand-crafted features that fail to balance accuracy and efficiency, particularly for low-resource languages like Bangla. In this work, we present SpectroFusion-ViT, a lightweight SER framework built utilizing EfficientViT-b0, a compact Vision Transformer architecture equipped with self-attention to capture long-range temporal and spectral patterns. The model contains only 2.04M parameters and requires 0.1 GFLOPs, enabling deployment in resource-constrained settings without compromising accuracy. Our pipeline first performs preprocessing and augmentation on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Music and Audio Processing