Are Mamba-based Audio Foundation Models the Best Fit for Non-Verbal Emotion Recognition?
Mohd Mujtaba Akhtar, Orchid Chetia Phukan, Girish, Swarup Ranjan Behera, Ananda Chandra Nayak, Sanjib Kumar Nayak, Arun Balaji Buduru, Rajesh Sharma

TL;DR
This paper evaluates mamba-based audio foundation models for non-verbal emotion recognition, demonstrating their superiority over attention-based models and proposing a fusion method that achieves state-of-the-art results.
Contribution
It introduces the first application of MAFMs for NVER and proposes RENO, a fusion approach with a novel loss function, to enhance emotion recognition performance.
Findings
MAFMs outperform AAFMs in NVER tasks.
Fusion of MAFMs and AAFMs with RENO achieves state-of-the-art results.
The proposed RENO method improves model alignment and performance.
Abstract
In this work, we focus on non-verbal vocal sounds emotion recognition (NVER). We investigate mamba-based audio foundation models (MAFMs) for the first time for NVER and hypothesize that MAFMs will outperform attention-based audio foundation models (AAFMs) for NVER by leveraging its state-space modeling to capture intrinsic emotional structures more effectively. Unlike AAFMs, which may amplify irrelevant patterns due to their attention mechanisms, MAFMs will extract more stable and context-aware representations, enabling better differentiation of subtle non-verbal emotional cues. Our experiments with state-of-the-art (SOTA) AAFMs and MAFMs validates our hypothesis. Further, motivated from related research such as speech emotion recognition, synthetic speech detection, where fusion of foundation models (FMs) have showed improved performance, we also explore fusion of FMs for NVER. To this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Emotion and Mood Recognition
MethodsSoftmax · Attention Is All You Need · Focus
