Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation   Models with Optimal Transport for Non-Verbal Emotion Recognition

Orchid Chetia Phukan; Mohd Mujtaba Akhtar; Girish; Swarup Ranjan; Behera; Sishir Kalita; Arun Balaji Buduru; Rajesh Sharma; S.R Mahadeva; Prasanna

arXiv:2409.14221·eess.AS·September 24, 2024

Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition

Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan, Behera, Sishir Kalita, Arun Balaji Buduru, Rajesh Sharma, S.R Mahadeva, Prasanna

PDF

Open Access

TL;DR

This paper demonstrates that combining multimodal foundation models with optimal transport techniques significantly improves non-verbal emotion recognition accuracy, surpassing previous state-of-the-art methods.

Contribution

The study introduces a novel framework called MATA that effectively combines multimodal foundation models using optimal transport, enhancing emotion recognition from non-verbal sounds.

Findings

01

MATA achieves top performance on benchmark datasets.

02

Combining MFMs outperforms individual models and baseline fusion.

03

Proposed method sets new state-of-the-art results.

Abstract

In this study, we investigate multimodal foundation models (MFMs) for emotion recognition from non-verbal sounds. We hypothesize that MFMs, with their joint pre-training across multiple modalities, will be more effective in non-verbal sounds emotion recognition (NVER) by better interpreting and differentiating subtle emotional cues that may be ambiguous in audio-only foundation models (AFMs). To validate our hypothesis, we extract representations from state-of-the-art (SOTA) MFMs and AFMs and evaluated them on benchmark NVER datasets. We also investigate the potential of combining selected foundation model representations to enhance NVER further inspired by research in speech recognition and audio deepfake detection. To achieve this, we propose a framework called MATA (Intra-Modality Alignment through Transport Attention). Through MATA coupled with the combination of MFMs: LanguageBind…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis