Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition
Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan, Behera, Sishir Kalita, Arun Balaji Buduru, Rajesh Sharma, S.R Mahadeva, Prasanna

TL;DR
This paper demonstrates that combining multimodal foundation models with optimal transport techniques significantly improves non-verbal emotion recognition accuracy, surpassing previous state-of-the-art methods.
Contribution
The study introduces a novel framework called MATA that effectively combines multimodal foundation models using optimal transport, enhancing emotion recognition from non-verbal sounds.
Findings
MATA achieves top performance on benchmark datasets.
Combining MFMs outperforms individual models and baseline fusion.
Proposed method sets new state-of-the-art results.
Abstract
In this study, we investigate multimodal foundation models (MFMs) for emotion recognition from non-verbal sounds. We hypothesize that MFMs, with their joint pre-training across multiple modalities, will be more effective in non-verbal sounds emotion recognition (NVER) by better interpreting and differentiating subtle emotional cues that may be ambiguous in audio-only foundation models (AFMs). To validate our hypothesis, we extract representations from state-of-the-art (SOTA) MFMs and AFMs and evaluated them on benchmark NVER datasets. We also investigate the potential of combining selected foundation model representations to enhance NVER further inspired by research in speech recognition and audio deepfake detection. To achieve this, we propose a framework called MATA (Intra-Modality Alignment through Transport Attention). Through MATA coupled with the combination of MFMs: LanguageBind…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis
